Incident Review: OIDC/OAuth Errors

Incident Review

The DRI for the incident review is the issue assignee.

If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included.
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident.
Fill out relevant sections below or link to the meeting review notes that cover these topics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. Users who were "signing in with GitLab" with an OIDC flow (for example through Vault)
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. Users would not be able to login with GitLab through their other apps
How many customers were affected?
1. Because these errors would have happened on the client side, we are not sure how many customers were affected
2. Here is a collection of tickets
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. About 64,737 requests to the /discovery/keys endpoint would not have succeeded, though it is unclear how many of these were using the response to validate tokens.

What were the root causes?

#8679 (comment 1342880722)

Incident Response Analysis

How was the incident detected?
1. During the investigation of a separate incident, users reported an issue with JWTs generated via the OIDC flow, used in their Vault or CI setups failing to verify with the JWKS issued by GitLab.
How could detection time be improved?
1. End-to-end testing for GitLab as an OICD provider, or adding synthetic monitoring for JWKS and token issue endpoints to identify incorrectly generated tokens/keys would have detected the problem in advance
How was the root cause diagnosed?
1. Support posted in the Auth Slack channel about an influx of issues. @tachyons-gitlab identified that the symptoms pointed to token generation and noted that the doorkeeper gem had been updated recently. In checking the comments on the PR we realized this was called out in a review comment as a potential breaking change.
How could time to diagnosis be improved?
1. It took about 30 minutes to diagnose the root cause after declaring the incident
How did we reach the point where we knew how to mitigate the impact?
1. We tested a revert MR of the original upgrade
How could time to mitigation be improved?
1. According to the timeline:
  1. Incident declared 18:18 (6 hours 18 minutes after deploy)
  2. Revert MR opened 18:44 (26 minutes later)
  3. Revert MR merged 19:39 (55 minutes later)
  4. Revert MR deployed 00:36 (4 hours 57 minutes later)
  5. Mitigated 01:59 (1 hour 23 minutes)
2. The bulk of this time was spent before the incident was declared, and during steps 3-5

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. ...
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. Yes, scheduled for this release too! gitlab-org/quality/testcases#3984 (closed)
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. Yes, https://gitlab.com/gitlab-org/gitlab/-/issues/383288 (MR: gitlab-org/gitlab!116142 (merged))

What went well?

Guidelines

Blameless RCA Guideline

Edited Apr 05, 2023 by Luke Duncalfe