Incident Review: Groups inaccessible and 500 errors

Incident Review

The DRI for the incident review is the issue assignee.

If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included.
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident.
Fill out relevant sections below or link to the meeting review notes that cover these topics

Who was impacted by this incident? (i.e. external customers, internal customers)
1. GitLab.com customers were impacted by this incident.
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. Customers were not able to access groups/projects from their top-level namespaces via both web and API.
How many customers were affected?
1. 12698 #14470 (comment 1397991408)
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. Here is an estimation based on the reported ticket on ZenDesk, #14469 (comment 1397899967).

The incident happened due to an issue where a database migration caused an inconsistent database schema that was cached in the Rails application. Restarting the Web/API servers resolved the issue. Please see #14469 (comment 1397859217) for more details.

How was the incident detected?
1. SLO Error alerts were received for both Web and API.
2. The issue was first reported by GitLab.com customers.
How could detection time be improved?
1. By examining our mixed deployment and tests, we hope to be able to prevent this from happening by catching the issues early on in the deployment stage.
How was the root cause diagnosed?
1. @mayra-cabrera pointed out possible the root cause as it seemed like an incident that happened before.
How could time to diagnosis be improved?
1. ...
How did we reach the point where we knew how to mitigate the impact?
1. Based on root cause analysis, we reached the point where we knew how to mitigate.
How could time to mitigation be improved?
1. We could invalidate the schema cache when we encounter errors like this gitlab-org/gitlab#412980 (closed)
2. Automate restarting Puma delivery#19301

Did we have other events in the past with the same root cause?
1. Yes. #5171 (closed)
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. Examine our mixed deployment environment and tests, #14470 (comment 1409259493)
2. Application side of improvement, gitlab-org/gitlab#412824 (closed).
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. Yes. gitlab-org/gitlab#394796 (closed)

We were able to relate this incident to previous ones and identified mitigation solution quickly
The teams worked hard to identify the post-incident action items.

Edited Jun 06, 2023 by Changzheng Liu