Incident Review: High database load caused slow and unresponsive merge requests and CI pipelines

Incident Review

The DRI for the incident review is the issue assignee.

If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included.
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident.
Fill out relevant sections below or link to the meeting review notes that cover these topics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. External Customers
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. MRs and CI pipelines were delayed or unresponsive on gitlab.com
How many customers were affected?
1. All SaaS customers
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. All customers were impacted, but there were only 8 customer reports (possibly due to overlap with US Thanksgiving week)

What were the root causes?

A SELECT query on the primary db was inefficient under load (the query clause generated 85K ids in its IN clause)
This lead to saturation of both the database and Sidekiq connection pool, as the worker tried to process all the jobs on the Sidekiq queue.

Incident Response Analysis

How was the incident detected?
1. EOC paged for Sidekiq queue saturation - SidekiqServiceSidekiqQueueingApdexSLOViolationSingleShard for urgent-cpu-bound
How could detection time be improved?
1. If we could identify the worker and associated query and automatically notify related team to investigate (ie. on Slack).
How was the root cause diagnosed?
1. Security::ProcessScanResultPolicyWorker was disabled via feature flag
How could time to diagnosis be improved?
1. The pg_stat_statements in Elasticsearch were incomplete due to the size of the query, which made figuring out which query this was harder. We have a runbook on how to debug slow queries, but it heavily depends on Elasticsearch. This should be updated to include all the things @msmiley did during the incident.
2. We have a runbook that says how to disable Sidekiq workers via Feature Flags that needs to be more clearly renamed and updated to function in all situations (how to disable if a worker has namespace, how to translate a worker to the right feature flag).
3. If Sidekiq had an owning team (from the POV of both how the code works and how the infrastructure works), we could have reached out to them.
How did we reach the point where we knew how to mitigate the impact?
1. When we identified that the query couldn't handle the added load from the enterprise customer was identified as root cause for causing the saturation
How could time to mitigation be improved?
1. Finding available database and backend maintainers who could help us with the merging the MR to amend the query.

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. We had incidents where the same worker was involved, the root cause was slightly different:
  - 2023-09-12: Increase in sidekiq traffic causing... (#16348 - closed)
  - 2023-03-31: SidekiqServiceShardCatchallErrorSLO... (#8640 - closed)
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. Use database read model for merge request appr... (gitlab-org&9971)
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. Code had been in place for a while, but incident arose as the query had not been tested at scale.

What went well?

We disabled the worker via feature flag without needing to patch the code which mitigated the system degradation.
We identified exactly which query (through pg_stat_queries) that was leading to the database and Sidekiq saturation. This was partly luck in that we had an SME in figuring all of this out, as the logs were incomplete.
We identified the customer who was adding the increased load through correlation_ids, and were able to contact them to pause on their operations.
This allowed us time to then work on 1) processing the large number of jobs (850K) still in the queue in a separate shard and 2) allowed the Govern team to work on a fix for the query.

Guidelines

Blameless RCA Guideline

Edited Dec 19, 2023 by Stephanie Jackson