Incident Review: GitLab.com is down

Key Information

Metric	Value
Customers Affected	All
Requests Affected	All
Incident Severity	severity1
Start Time	5:00 UTC
End Time	5:20 UTC
Total Duration	20mins
Link to Incident Issue	#18490 (closed)

Summary

GitLab.com was completely inaccessible for several minutes because the Postgres database cluster had been shut down. Customers trying to use the site would have seen 503 errors.

Details

The database was mistakenly shutdown during the incident call for incident #18489 (closed), as we were discussing if the recently created patroni-main-v16 cluster (CR #18454 (closed), for the v16 Upgrade project) could be causing the issue. There was a time coincidence as the time the new cluster was created as standby cluster of patroni-main-v14 was close to the time we started to observe degradation in the Sidekiq workers.

Even that we found that no workload was being redirected to the new v16 standby cluster we decided to stop the new v16 clusters to reduce the scope of the RCA as we were struggling to find the actual cause of the issue - https://gitlab.slack.com/archives/C07L9QWEZJ4/p1725338904349739

At the time, the command mistakenly executed was

knife ssh "role:gprd-base-db-patroni-main-v14" "sudo service patroni stop"

It should have been to "role:gprd-base-db-patroni-main-v16"

It took us a few minutes to notice the mistake, and re-issue the command to start back the Patroni service in the v14 cluster.

Outcomes/Corrective Actions

Learning Opportunities

What can be improved?

Have an alternative method/tool to replace knife ssh that allow cluster level OS actions, but with confirmation and peer review, specifically for requests to disrupt/stop services or perform dangerous actions (eg. perform file cleanups through rm) in multiple nodes.
...

What went well?

Our observation metrics and alerts quickly showed that the service was shutdown
The team on the call quickly figured that the problem was a service shutdown

What was difficult?

Review Guidelines

This review should be completed by the team which owns the service causing the alert. That team has the most context around what caused the problem and what information will be needed for an effective fix. The EOC or IMOC may create this issue, but unless they are also on the service owning team, they should assign someone from that team as the DRI.

For the person opening the Incident Review

Set the title to Incident Review: (Incident issue name)
Assign a Service::* label (most likely matching the one on the incident issue)
Set a Severity::* label which matches the incident
In the Key Information section, make sure to include a link to the incident issue
Find and Assign a DRI from the team which owns the service (check their slack channel or assign the team's manager) The DRI for the incident review is the issue assignee.
Announce the incident review in the incident channel on Slack.

:mega: @here An incident review issue was created for this incident with <USER> assigned as the DRI.
If you have any review feedback please add it to <ISSUE_LINK>.

For the assigned DRI

Fill in the remaining fields in the Key Informtation section, using the incident issue as a reference. Feel free to ask the EOC or other folks involved if anything is difficult to find.
If there are metrics showing Customers Affected or Requests Affected, link those metrics in those fields
Create a few short sentences in the Summary section summarizing what happened (TL;DR)
Use the description section to write a few paragraphs explaining what happened
Link any corrective actions and describe any other actions or outcomes from the incident
Consider the implications for self-managed and Dedicated instances. For example, do any bug fixes need to be backported?
Add any appropriate labels based on the incident issue and discussions
Once discussion wraps up in the comments, summarize any takeaways in the details section
Close the review before the due date

Edited Sep 10, 2024 by Rafael Henchen