Incident Review: 2023-10-30 Gitlab.com is down

Incident Review

The DRI for the incident review is the issue assignee.

If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included.
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident.
Fill out relevant sections below or link to the meeting review notes that cover these topics https://docs.google.com/document/d/1jrX-Z2NJrNjBBcywY7emQKwaKRqVAlDRdGG0Krk76ys/edit#

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. All GitLab.com users.
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. The primary database for GitLab.com was saturated due to a spike in bulk import jobs. This caused an intermittent disruption in accessing GitLab.com including web services, API services, and Git operations.
How many customers were affected?
1. Customers trying to access GitLab.com between 2023-10-30 15:27 UTC to 16:15 UTC. Duration of 48 minutes of service disruptions.
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. All GitLab.com users were impacted.

What were the root causes?

A detailed root cause discussion is available here.

The root cause was database saturation on the merge_requests table caused by bulk import jobs. #17054 (comment 1627154963). Bulk import support was introduced a few months ago and a feature flag was flipped enabling that functionality. This caused:

High UPDATE rate on merge_requests caused the frequent queries against that table to become progressively less efficient, saturating the db connection pool.
- Replication lag exceeding the configured staleness tolerance caused additional demand on the primary db's connection pool, since most read-only queries were no longer being offloaded to the replica dbs.

Incident Response Analysis

How was the incident detected?
1. Noticed drop on RPS (Requests per second) and DB saturation. Initially received alerts about Apdex SLO drop for all front end components. #17054 (comment 1627147834)
How could detection time be improved?
1. Looking at the timeline in the incident and the alert from Pagerduty, the engineer on call (EOC) received an alert at 15:27 UTC and declared an incident in that same moment, so the detection time was immediate.
How was the root cause diagnosed?
1. After some investigation we noticed that patroni-main primary node was experiencing high CPU load. We also saw table bloat and dead tuple increase for the following tables: merge_requests, merge_requests_diffs_commits. After reviewing the database activity we identified a set of queries with INSERT/UPDATE statements to the merge_requests table, that were related to a specific correlation ID. This helped us identify the problem was related to a bulk import and the BulkImports::PipelineBatchWorker class. #17054 (comment 1627147834)
How could time to diagnosis be improved?
1. In terms of the Direct Transfer/Bulk Import feature, it could be helpful to track what BulkImports are active at any given time. If there's a similar interruption of service, and we can see that a large number of bulk import jobs have been created, it could provide a clue that bulk import may have caused the interruption.
2. Grafana dashboards for individual services should contain links to their related runbooks. This would make it easier for the EOC to refer to them, without having to search the whole repo.
How did we reach the point where we knew how to mitigate the impact?
1. After identifying an increase in the merge_requests table being caused by BulkImports, we temporarily disabled import functionality for all of GitLab.com via the Admin setting. This prevented the creation of new import jobs and allowed us time to continue investigating. We then managed to identify a specific bulk import job as a perpetrator also disabled the bulk_imports_batched_import_export feature via a feature flag to prevent another occurrence. Once the feature flag was disabled, we reenabled the import functionality using individual jobs rather than batched jobs.
How could time to mitigation be improved?
1. Once the diagnoses were made, the time to mitigation was pretty good. It could have been improved by having a PM from the import and integrate team on the zoom call, simply to make relevant decisions faster, but communication was done async in slack and there were no inordinate delays. I also think that time to mitigation shouldn't be rushed as such, as mitigations can have adverse customer impact.

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. No
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. No
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. Enabled bulk_imports_batched_import_export feature flag on .com which was introduced in gitlab-org/gitlab!124434 (merged) (issue link: gitlab-org/gitlab#391224 (closed)). Note that this change was done several months prior.

What went well?

Lots of help from different team members jumping into help. We were able to quickly pull in the necessary domain expertise from DBRE and the Import groups.
Google doc was used for organizing notes while .com was down.
Customers and e-group were kept informed with regular status updates

Guidelines

Blameless RCA Guideline

Edited Nov 10, 2023 by Jerome Z Ng