2020-09-04 Self-managed users unable to login using 2fa after a security release
Unable to login using 2fa
After a security release, customers using LDAP reported that they can no longer login into their instances when users are using 2fa . This affected a large number of customers and was tracked in gitlab-org/gitlab#244638 (closed).
The root cause is a difference in the precision of the
updated_at timestamp between what is stored in PostgreSQL and what is stored in memory. A good resource for understanding what happens with timestamps in Ruby is https://www.toptal.com/ruby-on-rails/timestamp-truncation-rails-activerecord-tale. This specifically affected LDAP sign-in because of an additional access check that 'touches' the user object.
A rough synopsis of the sequence of events causing failure:
- User signs in with the LDAP form on the sign-in page.
- GitLab performs a secondary access check and updates the
last_credential_check_attimestamp. https://gitlab.com/gitlab-org/gitlab/-/blob/5f75c6f5184dcf50f4895e2baecc871ebdc657a9/lib/gitlab/auth/ldap/access.rb#L26. This causes the user object to be touched. The result is that a more precise timestamp is written to memory (
user.updated_at), but PostgreSQL truncates the timestamp value at a less precise value.
- The in-memory more precise value is written to
session[:user_updated_at]for comparison in the next request (after a user enters their 2FA OTP).
- User enters their 2FA OTP and continues.
- GitLab loads the user object from the database, with the now truncated timestamp value. We compare this to the
session[:user_updated_at]more precise value and the comparison fails.
- GitLab fails sign-in and returns user to the sign-in page.
See gitlab-org/gitlab#244638 (comment 406763536) for more details about the root cause.
All times UTC.
- 00:20 (corrected security release is published and available to self-managed users: gitlab-com/www-gitlab-com!61514 (merged) and https://gitlab.slack.com/archives/C0139MAV672/p1599083641126500
- 00:42 User raises an issue about their inability to login gitlab-org/gitlab#244638 (closed)
- 08:30 User posts on twitter https://twitter.com/benoitmortier1/status/1301437487722704896
- 08:48 @katrinleinweber posts in the #g_manage_access about raising number of customer support requests https://gitlab.slack.com/archives/CLM1D8QR0/p1599122899153300
- 09:21 @mksionek reacts with
👀and starts to investigate the bug with @cat. @manojmj is involved as well.
- 13:22 @cat submits a MR proposal gitlab-org/gitlab!41327 (merged)
- 13:33 @mksionek suggests a change but generally agrees with the proposal gitlab-org/gitlab!41327 (comment 406789713)
- 13:34 @cat asks for backporting the proposal https://gitlab.slack.com/archives/CCFV016SV/p1599140082137000
- 14:36 MR assigned to maintainer for review
- 14:46 @dblessing asks for Appsec approval
- 17:41 Maintainer approves the MR
- 01:00 @rchan-gitlab from Appsec approves
- 01:47 @dblessing merges the MR and the fix is ready
- 01:49 MR is picked into all backport preparation branches
- 03:09 Preparation MR is merged into backport branches gitlab-org/gitlab!41341 (merged)
- 08:02 @marin checks the failing builds in backport branches, which turn out to be flaky failures and retries https://dev.gitlab.org/gitlab/gitlab-ee/-/pipelines/167982 , https://dev.gitlab.org/gitlab/gitlab-ee/-/pipelines/167978 , https://dev.gitlab.org/gitlab/omnibus-gitlab/-/pipelines/167970, https://dev.gitlab.org/gitlab/gitlabhq/-/pipelines/167979
- 08:36 @marin initiates the tagging process https://gitlab.slack.com/archives/C0139MAV672/p1599208569180800
- 10:14 Deployment to the release environment is completed and automated QA is executed https://gitlab.slack.com/archives/C8PKBH3M5/p1599214450033900
- 10:56 Blog post is created and ready for review gitlab-com/www-gitlab-com!61710 (merged)
- 12:14 All packages finished building and are being published https://gitlab.slack.com/archives/C8PKBH3M5/p1599221571034500
- 13:20 Blog post is merged
- 13:54 Blog post is published and available https://about.gitlab.com/releases/2020/09/04/gitlab-13-3-5-released/
- 14:02 Release issues closed, completing the release: eg. gitlab-org/release/tasks#1595 (closed)
- Service(s) affected:
- Team attribution:
- Minutes downtime or degradation:
- Who was impacted by this incident? (i.e. external customers, internal customers)
- Self-managed customers using LDAP and 2FA with GitLab versions 13.3.3, 13.2.7, 13.1.9
- What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- Any user signing in with LDAP and 2FA enabled were unable to sign-in. They were returned to the sign-in page after entering their 2FA one-time code.
- How many customers were affected?
- Not a precise number, we got at least 13 support tickets in Zendesk (from a simple text search, likely to be more).
- If a precise customer impact number is unknown, what is the estimated potential impact?
Incident Response Analysis
- How was the event detected?
- An influx of customer tickets reporting the same problem after updating, together with community messages / the issue with multiple comments of users experiencing the issue.
- How could detection time be improved?
- How did we reach the point where we knew how to mitigate the impact?
- After a synchronous call we found out that the problem is OS-related - it was not present on MacOS, but present on Linux machine. From that realisation we were trying to put break-points and inspect code in different places. Finally we found out the place were the problem occurred.
- How could time to mitigation be improved?
Post Incident Analysis
- How was the root cause diagnosed?
- Based on user reports, we were able to reproduce the problem in the GDK with a local LDAP server. After reproducing the issue (with some discussion in a Slack thread), we found that the cause was a security fix, so reverting it would be bad, we debugged the changes introduced by it and finally diagnosed the obscure Rails/PostgreSQL behavior.
- How could time-to-diagnosis be improved?
- This was a fairly obscure Rails/PostgreSQL behavior that was difficult to diagnose. The biggest improvement we should take away from this issue is the overall fix and release process for a bug/regression. Once we diagnosed the issue it took too long to review, merge and release.
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
- Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?
- Code change - security fix https://gitlab.com/gitlab-org/security/gitlab/-/merge_requests/793
- Self-managed instances using LDAP and 2FA were unable to sign in. Why?
- Security bug fix suffered an operating-system-dependent Rails vs. PostgreSQL timestamp truncation and comparison bug.
- Why did this bug not get noticed in manual testing?
- LDAP sign-in flow was not tested manually.
- Even if it was, it may not have been caught because Mac OS only sometimes experiences the truncation problem (about 25% of the time?). See https://www.toptal.com/ruby-on-rails/timestamp-truncation-rails-activerecord-tale for details about the behavior.
- Why did this bug not get caught by automated testing?
- Specific tests for LDAP weren't written. It's possible automated testing wouldn't have caught the issue, either.
- Why did it take approximately 38 hours to release a fix?
- The issue took approximately 5 hours from initial user report to diagnosing. The nature of the bug contributed to this.
- The issue lacked a clear owner responsible for pinging the right people for review, merge and release.
- The release process was unclear to many.
- The process for releasing a fix is more clear now.
- We should identify a clear owner and hand-off process to ensure the issue moves forward in a timely manner.
- Be careful comparing timestamps. Use
.to_ior a specific precision to prevent the problem we saw.
- Anything related to sign-in should be tested not only with GitLab (local) sign-in, but LDAP and OmniAuth, too.
- Ensure process documentation is updated to indicate that an owner should be identified to drive the fix to completion
- Ensure process documentation is easily accessible and known to developers.
- gitlab-org/release/docs#43 - Consolidate docs from handbook and release docs project into one location. Cross-link as needed.