Incident Review: The sidekiq_queueing SLI of the sidekiq service on shard urgent-cpu-bound has an apdex violating SLO
Key Information
| Metric | Value |
|---|---|
| Customers Affected | All users on gitlab.com using features that rely on Sidekiq in some capacity |
| Requests Affected | Sidekiq jobs, especially those on the urgent shards, saw increased processing delays |
| Incident Severity | severity1 |
| Start Time | 02:25 UTC |
| End Time | 10:44 UTC |
| Total Duration | 8 hours 20 minutes |
| Link to Incident Issue | #18489 (closed) |
Summary
Sidekiq jobs could not complete as quickly as usual due to the pg-bouncer cluster serving Sidekiq queries being saturated. This led to noticeable delays in operations that rely on background jobs such as email delivery, AI chat responses etc.
Details
- Root cause is of yet unclear.
-
A summary of the evidence points to the establishment of a positive feedback loop between thread contention, exclusive DB leases by the
FlushCounterIncrementsWorkerand the creation of idle transactions leading to DB PgBouncer pool saturation. Additionally, job de-duplication along with long worker run times and short lease time leads to worker rescheduling that causes further DB contention. -
Slow Redis client connections like in the
FlushCounterIncrementsWorkerand in Markdown rendering could have also exacerbated this feedback loop. - We know that
FlushCounterIncrementsWorkercontributed to pgbouncer connection saturation but weren't able to correlate this increase to any particular change in code or configuration- More details here: #18489 (comment 2088341818) (internal link)
- As a corrective action, we have updated the
FlushCounterIncrementsWorkerto stop acquiring exclusive lease for every database updates on project statistics.
Outcomes/Corrective Actions
Learning Opportunities
What went well?
- We had many people voluntarily jump in to help investigate as it became clearer that the root cause was not cut and dry.
- It was also helpful to have people not directly involved in the incident prompting for severity upgrades as the EOC was engrossed in investigation.
What was difficult?
- It was difficult to pinpoint the cause and/or separate cause and symptoms.
- The first alert for this incident triggered just after a significant CR involving the database was completed.
- Human error while investigating this incident led to a separate sitewide outage. This caused many people to confuse the two incidents and their timelines.
- Multiple exploratory mitigation actions being run in parallel made it unclear what was ultimately responsible for recovery.
- The deferral of
FlushCounterIncrementsWorkerjobs took place at the same time as a rollback deployment. Even now it's not entirely clear which action finally took load off the database.
- The deferral of
- From am IM perspective, I found this one hard to manage because there were many potential signals but it was not clear which ones were just noise (or which ones were symptoms and not causes). This led to many ideas for possible mitigations being proposed but it was not clear which ones were more promising than others. For example, some people were in favor of rolling back earlier, while others were in favor of attempting smaller more surgical changes to see which ones might stick.
Review Guidelines
This review should be completed by the team which owns the service causing the alert. That team has the most context around what caused the problem and what information will be needed for an effective fix. The EOC or IMOC may create this issue, but unless they are also on the service owning team, they should assign someone from that team as the DRI.
For the person opening the Incident Review
-
Set the title to Incident Review: (Incident issue name) -
Assign a Service::*label (most likely matching the one on the incident issue) -
Set a Severity::*label which matches the incident -
In the Key Informationsection, make sure to include a link to the incident issue -
Find and Assign a DRI from the team which owns the service (check their slack channel or assign the team's manager) The DRI for the incident review is the issue assignee. -
Announce the incident review in the incident channel on Slack.
:mega: @here An incident review issue was created for this incident with <USER> assigned as the DRI.
If you have any review feedback please add it to <ISSUE_LINK>.
For the assigned DRI
-
Fill in the remaining fields in the Key Informtationsection, using the incident issue as a reference. Feel free to ask the EOC or other folks involved if anything is difficult to find. -
If there are metrics showing Customers AffectedorRequests Affected, link those metrics in those fields -
Create a few short sentences in the Summary section summarizing what happened (TL;DR) -
Use the description section to write a few paragraphs explaining what happened -
Link any corrective actions and describe any other actions or outcomes from the incident -
Consider the implications for self-managed and Dedicated instances. For example, do any bug fixes need to be backported? -
Add any appropriate labels based on the incident issue and discussions -
Once discussion wraps up in the comments, summarize any takeaways in the details section -
Close the review before the due date
Edited by Chase Southard