Incident Review: MRs stuck with 500 graphql errors

Incident Review

The DRI for the incident review is the issue assignee.

If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included.
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident.
Fill out relevant sections below or link to the meeting review notes that cover these topics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. All users using approvals on merge requests with the reset approvals on push feature.
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. The MRs can't be merged because they are stuck in a broken state. Just like what's illustrated here: gitlab-org/gitlab#404947 (comment 1343346997).
How many customers were affected?
1. During the incident we looked at errors and saw errors from approximately 750 customers over a 24-hour period
2. There were some support tickets from customers: #8608 (comment 1331093221).
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?

What were the root causes?

Unhandled edge case and unhandled mergeability check reason introduced in https://gitlab.com/gitlab-org/security/gitlab/-/merge_requests/3043.

Incident Response Analysis

How was the incident detected?
1. On #g_create_code-review channel, an issue was reported that support is seeing an increase of errors related to the changes made.
2. Then it was reported to #releases channel to ask how we can fix it since it's a security fix.
3. Release manager asked to open an incident, so incident https://gitlab.com/gitlab-com/gl-infra/production/-/issues/8606 was created.
How could detection time be improved?
1. ...
How was the root cause diagnosed?
1. The root cause was easy to be found because the errors are on point. We're able to identify that it was caused by https://gitlab.com/gitlab-org/security/gitlab/-/merge_requests/3043.
How could time to diagnosis be improved?
1. ...
How did we reach the point where we knew how to mitigate the impact?
1. At first, decided to fix the GraphQL error since it seemed to be a transient issue (gets resolved in 10s after approvals are reset)
2. Then upon getting more reports, some MRs are getting stuck and doesn't recover at all.
3. An edge case has been found
How could time to mitigation be improved?
1. If we had a more acceptable workaround (the workaround was to disable the reset approvals feature), we can recommend that to customers/users so they won't need to wait for the revert MR to be deployed.
2. While getting a revert MR was quick, we can maybe reduce the time to get it deployed.
3. There were also some questions raised during the incident about what needs to be done to get it on production since it's the security repo (process is a bit different compared to fixes going in canonical repo).

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. ...
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. ...
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. This is triggered by the changes in https://gitlab.com/gitlab-org/security/gitlab/-/merge_requests/3043.

What went well?

We were able to find the root cause in a timely manner because of clear error messages
We were able to create a revert MR quickly
We had coverage over multiple time zones

Guidelines

Blameless RCA Guideline

Edited Apr 10, 2023 by Matt Nohr