Incident Review: OIDC/OAuth Errors
Incident Review
The DRI for the incident review is the issue assignee.
-
If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included. -
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident. -
Fill out relevant sections below or link to the meeting review notes that cover these topics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- Users who were "signing in with GitLab" with an OIDC flow (for example through Vault)
- What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
-
How many customers were affected?
- Because these errors would have happened on the client side, we are not sure how many customers were affected
- Here is a collection of tickets
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- About 64,737 requests to the
/discovery/keys
endpoint would not have succeeded, though it is unclear how many of these were using the response to validate tokens.
- About 64,737 requests to the
What were the root causes?
Incident Response Analysis
-
How was the incident detected?
- During the investigation of a separate incident, users reported an issue with JWTs generated via the OIDC flow, used in their Vault or CI setups failing to verify with the JWKS issued by GitLab.
-
How could detection time be improved?
- End-to-end testing for GitLab as an OICD provider, or adding synthetic monitoring for JWKS and token issue endpoints to identify incorrectly generated tokens/keys would have detected the problem in advance
-
How was the root cause diagnosed?
- Support posted in the Auth Slack channel about an influx of issues. @tachyons-gitlab identified that the symptoms pointed to token generation and noted that the doorkeeper gem had been updated recently. In checking the comments on the PR we realized this was called out in a review comment as a potential breaking change.
-
How could time to diagnosis be improved?
- It took about 30 minutes to diagnose the root cause after declaring the incident
-
How did we reach the point where we knew how to mitigate the impact?
- We tested a revert MR of the original upgrade
-
How could time to mitigation be improved?
- According to the timeline:
- Incident declared 18:18 (6 hours 18 minutes after deploy)
- Revert MR opened 18:44 (26 minutes later)
- Revert MR merged 19:39 (55 minutes later)
- Revert MR deployed 00:36 (4 hours 57 minutes later)
- Mitigated 01:59 (1 hour 23 minutes)
- The bulk of this time was spent before the incident was declared, and during steps 3-5
- According to the timeline:
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- ...
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- Yes, scheduled for this release too! gitlab-org/quality/testcases#3984 (closed)
- Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
What went well?
- ...
Guidelines
Edited by Luke Duncalfe