Incident review for #17754 - 2024-03-25: StoreSecurityReportsWorker saturates pgbouncer pool, causing delays in UI

Incident Review

The DRI for the incident review is the issue assignee.

If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included.
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident.
Fill out relevant sections below or link to the meeting review notes that cover these topics
If there is a need to schedule a synchronous review, complete the following steps:
- In this issue, @ mention the EOC, IMOC and other parties who were involved that we would like to schedule a sync review discussion of this issue.
- Schedule a meeting that works the best for those involved, in the agenda put a link to this review issue. The meeting should primarily discuss what is already documented in this issue, and any questions that arise from it.
- Ensure that the meeting is recorded, when complete upload the recording to GitLab unfiltered.

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. All customers UI/jobs with background processing
2. Customers using vulnerability management features
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. Initially, all background jobs were affected by long queues. We had reports of CI jobs taking a long time to start.
2. Once the broken worker was found, and quarantined, the effect was limited to customers using that worker and other customers were no longer affected.
How many customers were affected?
1. A larger number of Ultimate subscribers.
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. N/A

What were the root causes?

We run StoreSecurityReportsWorker for each pipeline for the default branch if it has security reports. In that worker, within a transaction, we create records for different tables. One of those tables is vulnerability_identifiers and the query to create the records for that table is an UPSERT query.

So what happened is that there were too many pipelines for a single project which caused running too many StoreSecurityReportsWorker jobs in parallel each of which tried to UPSERT the same records causing lock contention and long transaction times.

This resulted in backpressure on the PGBouncer pool locking out all the sidekiq workload from running database queries.

Incident Response Analysis

How was the incident detected?
1. Pagerduty alerts about queue lengths in Sidekiq
How could detection time be improved?
1. N/A
*How was the root cause diagnosed?
1. First identified that PGBounder was saturated 👉 #17754 (comment 1830083588)
2. Identified that all the canceled transitions was from a specific worker 👉 #17754 (comment 1830122015)
How could time to diagnosis be improved?
1. N/A
How did we reach the point where we knew how to mitigate the impact?
1. When we saw that the transactions getting canceled were from a single worker, so we deferred the jobs.
How could time to mitigation be improved?
1. We had some concerns about understanding if it was ok to defer the jobs or not and had to wait for the product team to give us the go-ahead. Given this was right after summer, most people were on PTO which took us a while to find someone.

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. Not for the exact worker, but we've had sidekiq saturation in the past; #17504 (closed), #17158 (closed), #17030 (closed), #17692 (closed), #17294 (closed), #17282 (closed)
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. ...
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. No, existing issue that was year for 5 years.

What went well?

We had good tooling to mitigate the incident, and a very clear runbook.
Good ownership from the product engineering team to find the issue, and focus on short-term/long-term fixes.

Guidelines

Blameless RCA Guideline

Edited Apr 15, 2024 by Steve Xuereb