Incident Review: The sidekiq_queueing SLI of the sidekiq service on shard urgent-cpu-bound has an apdex violating SLO

Key Information

Metric	Value
Customers Affected	All users on gitlab.com using features that rely on Sidekiq in some capacity
Requests Affected	Sidekiq jobs, especially those on the urgent shards, saw increased processing delays
Incident Severity	severity1
Start Time	02:25 UTC
End Time	10:44 UTC
Total Duration	8 hours 20 minutes
Link to Incident Issue	#18489 (closed)

Summary

Sidekiq jobs could not complete as quickly as usual due to the pg-bouncer cluster serving Sidekiq queries being saturated. This led to noticeable delays in operations that rely on background jobs such as email delivery, AI chat responses etc.

Details

Root cause is of yet unclear.
A summary of the evidence points to the establishment of a positive feedback loop between thread contention, exclusive DB leases by the FlushCounterIncrementsWorker and the creation of idle transactions leading to DB PgBouncer pool saturation. Additionally, job de-duplication along with long worker run times and short lease time leads to worker rescheduling that causes further DB contention.
Slow Redis client connections like in the FlushCounterIncrementsWorker and in Markdown rendering could have also exacerbated this feedback loop.
We know that FlushCounterIncrementsWorker contributed to pgbouncer connection saturation but weren't able to correlate this increase to any particular change in code or configuration
- More details here: #18489 (comment 2088341818) (internal link)
- As a corrective action, we have updated the FlushCounterIncrementsWorker to stop acquiring exclusive lease for every database updates on project statistics.

Outcomes/Corrective Actions

https://gitlab.com/gitlab-org/gitlab/-/issues/482785+

Learning Opportunities

What went well?

We had many people voluntarily jump in to help investigate as it became clearer that the root cause was not cut and dry.
It was also helpful to have people not directly involved in the incident prompting for severity upgrades as the EOC was engrossed in investigation.

What was difficult?

It was difficult to pinpoint the cause and/or separate cause and symptoms.
- The first alert for this incident triggered just after a significant CR involving the database was completed.
- Human error while investigating this incident led to a separate sitewide outage. This caused many people to confuse the two incidents and their timelines.
Multiple exploratory mitigation actions being run in parallel made it unclear what was ultimately responsible for recovery.
- The deferral of FlushCounterIncrementsWorker jobs took place at the same time as a rollback deployment. Even now it's not entirely clear which action finally took load off the database.
From am IM perspective, I found this one hard to manage because there were many potential signals but it was not clear which ones were just noise (or which ones were symptoms and not causes). This led to many ideas for possible mitigations being proposed but it was not clear which ones were more promising than others. For example, some people were in favor of rolling back earlier, while others were in favor of attempting smaller more surgical changes to see which ones might stick.

Review Guidelines

This review should be completed by the team which owns the service causing the alert. That team has the most context around what caused the problem and what information will be needed for an effective fix. The EOC or IMOC may create this issue, but unless they are also on the service owning team, they should assign someone from that team as the DRI.

For the person opening the Incident Review

Set the title to Incident Review: (Incident issue name)
Assign a Service::* label (most likely matching the one on the incident issue)
Set a Severity::* label which matches the incident
In the Key Information section, make sure to include a link to the incident issue
Find and Assign a DRI from the team which owns the service (check their slack channel or assign the team's manager) The DRI for the incident review is the issue assignee.
Announce the incident review in the incident channel on Slack.

:mega: @here An incident review issue was created for this incident with <USER> assigned as the DRI.
If you have any review feedback please add it to <ISSUE_LINK>.

For the assigned DRI

Fill in the remaining fields in the Key Informtation section, using the incident issue as a reference. Feel free to ask the EOC or other folks involved if anything is difficult to find.
If there are metrics showing Customers Affected or Requests Affected, link those metrics in those fields
Create a few short sentences in the Summary section summarizing what happened (TL;DR)
Use the description section to write a few paragraphs explaining what happened
Link any corrective actions and describe any other actions or outcomes from the incident
Consider the implications for self-managed and Dedicated instances. For example, do any bug fixes need to be backported?
Add any appropriate labels based on the incident issue and discussions
Once discussion wraps up in the comments, summarize any takeaways in the details section
Close the review before the due date

Edited Sep 19, 2024 by Chase Southard