Incident Review: High database load caused slow and unresponsive merge requests and CI pipelines
Incident Review
The DRI for the incident review is the issue assignee.
- If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included.
- If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident.
- Fill out relevant sections below or link to the meeting review notes that cover these topics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- External Customers
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- MRs and CI pipelines were delayed or unresponsive on gitlab.com
-
How many customers were affected?
- All SaaS customers
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- All customers were impacted, but there were only 8 customer reports (possibly due to overlap with US Thanksgiving week)
What were the root causes?
- A SELECT query on the primary db was inefficient under load (the query clause generated 85K
ids
in itsIN
clause) - This lead to saturation of both the database and Sidekiq connection pool, as the worker tried to process all the jobs on the Sidekiq queue.
Incident Response Analysis
-
How was the incident detected?
- EOC paged for Sidekiq queue saturation -
SidekiqServiceSidekiqQueueingApdexSLOViolationSingleShard for urgent-cpu-bound
- EOC paged for Sidekiq queue saturation -
-
How could detection time be improved?
- If we could identify the worker and associated query and automatically notify related team to investigate (ie. on Slack).
-
How was the root cause diagnosed?
-
Security::ProcessScanResultPolicyWorker
was disabled via feature flag
-
-
How could time to diagnosis be improved?
- The pg_stat_statements in Elasticsearch were incomplete due to the size of the query, which made figuring out which query this was harder. We have a runbook on how to debug slow queries, but it heavily depends on Elasticsearch. This should be updated to include all the things @msmiley did during the incident.
- We have a runbook that says how to disable Sidekiq workers via Feature Flags that needs to be more clearly renamed and updated to function in all situations (how to disable if a worker has namespace, how to translate a worker to the right feature flag).
- If Sidekiq had an owning team (from the POV of both how the code works and how the infrastructure works), we could have reached out to them.
-
How did we reach the point where we knew how to mitigate the impact?
- When we identified that the query couldn't handle the added load from the enterprise customer was identified as root cause for causing the saturation
-
How could time to mitigation be improved?
- Finding available database and backend maintainers who could help us with the merging the MR to amend the query.
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- We had incidents where the same worker was involved, the root cause was slightly different:
- Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- Code had been in place for a while, but incident arose as the query had not been tested at scale.
What went well?
- We disabled the worker via feature flag without needing to patch the code which mitigated the system degradation.
- We identified exactly which query (through pg_stat_queries) that was leading to the database and Sidekiq saturation. This was partly luck in that we had an SME in figuring all of this out, as the logs were incomplete.
- We identified the customer who was adding the increased load through correlation_ids, and were able to contact them to pause on their operations.
- This allowed us time to then work on 1) processing the large number of jobs (850K) still in the queue in a separate shard and 2) allowed the Govern team to work on a fix for the query.
- ...