Incident Review: Groups inaccessible and 500 errors
Incident: #14469 (closed)
Incident Review
The DRI for the incident review is the issue assignee.
-
If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included. -
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident. -
Fill out relevant sections below or link to the meeting review notes that cover these topics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- GitLab.com customers were impacted by this incident.
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- Customers were not able to access groups/projects from their top-level namespaces via both web and API.
- How many customers were affected?
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- Here is an estimation based on the reported ticket on ZenDesk, #14469 (comment 1397899967).
What were the root causes?
- The incident happened due to an issue where a database migration caused an inconsistent database schema that was cached in the Rails application. Restarting the Web/API servers resolved the issue. Please see #14469 (comment 1397859217) for more details.
Incident Response Analysis
-
How was the incident detected?
- SLO Error alerts were received for both Web and API.
- The issue was first reported by GitLab.com customers.
-
How could detection time be improved?
- By examining our mixed deployment and tests, we hope to be able to prevent this from happening by catching the issues early on in the deployment stage.
-
How was the root cause diagnosed?
- @mayra-cabrera pointed out possible the root cause as it seemed like an incident that happened before.
-
How could time to diagnosis be improved?
- ...
-
How did we reach the point where we knew how to mitigate the impact?
- Based on root cause analysis, we reached the point where we knew how to mitigate.
-
How could time to mitigation be improved?
- We could invalidate the schema cache when we encounter errors like this gitlab-org/gitlab#412980 (closed)
- Automate restarting Puma delivery#19301
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- Yes. #5171 (closed)
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- Examine our mixed deployment environment and tests, #14470 (comment 1409259493)
- Application side of improvement, gitlab-org/gitlab#412824 (closed).
- Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
What went well?
- We were able to relate this incident to previous ones and identified mitigation solution quickly
- The teams worked hard to identify the post-incident action items.
Guidelines
Edited by Changzheng Liu