Incident Review: SidekiqServiceSidekiqQueueingApdexSLOViolationSingleShard

Key Information

Metric	Value
Customers Affected	At least `11 000` different root namespaces were affected.
Requests Affected	`2 000 000` failed requests (12% of total traffic) during the initial 5m database stall.
Incident Severity	severity1
Start Time	2024-09-10 12:55 UTC
End Time	2024-09-10 17:37 UTC
Total Duration	4 hours and 42 minutes (282 minutes)
Link to Incident Issue	#18535 (closed) => #18538 (closed)

Summary

We had a disk IO stall in the incident 2024-09-10: Increased errors on GitLab.com (#18535 - closed) starting around 12:55. During that incident. Requests trying to access the bad database node failed ~12% of requests. We then noticed unresponsive Sidekiq catchall pods which caused us not to correctly store CI-traces from Redis in persistent storage. When Redis was full, this caused our CI service to be unresponsive for users.

When we scaled up redis-tracechunks, users could continue to use GitLab-ci with degraded performance: because the queues were still backlogged, some operations would take a while to be picked up. Users experiences GitLab being slow during this time.

Details

We first noticed an increase in errors. We determined the cause was increased IO-wait in one database node. [Google confirmed](#18536 (comment 21148621410) this on their end. This problem self-resolved after about 5 minutes.

About an hour later, we noticed Redis trace chunks running out of memory and the catchall shard queue growing:

Sidekiq queues growing	Memory utilization redis-tracechunks

src	src

At around 14:00 redis-tracehuncks was full, at which point we stopped accepting incoming traces from the runner

We then scaled up the redis-tracechunks service, manually bumping up the machine type to 120G of memory as a temporary mitigation strategy. This brought a temporary headroom to continue the investigation even though the memory utilization continued growing.

redis-tracechunks filling up was caused by Sidekiq not processing as many jobs as it should. So we were not flushing those jobs to persistent storage. The obvious suspect was an earlier incident 2024-09-10: Increased errors on GitLab.com (#18535 (closed) - closed) where one of the database disks had an IO stall. This seemed to have left some Sidekiq-catchall pods where in a bad state: they had picked up jobs but weren't processing them. The decision was made to restart the affected pods which improved the situation. Sidekiq workers started processing jobs from the queue. New background jobs were being processed as normal.

In ~15 minutes after the pods restart, the background jobs queue was empty with all stuck jobs processed.

Outcomes/Corrective Actions

https://gitlab.com/gitlab-com/gl-infra/dbre/-/issues/238+: This host has gone bad before in #18269 (closed), so it seems prudent to replace it.
Disable connection verification in `ConnectionP... (gitlab-org/gitlab#490211 - closed): we've seen this kind of incident repeat itself in #18565 (closed), with a similar kind of IO stall. This fixes a bug in the application where we'd try to reconnect to the bad host when trying to mark the node as unhealthy.
Enable the `load_balancer_low_statement_timeout... (#18558 - closed): reducing the statement timeout in loadbalancer queries: this will result in clients spending less time checking likely unhealthy database-hosts.
scalability#3827: consider finishing continuous profiles for ruby workloads. This might have given us more insights into Sidekiq.

Learning Opportunities

What went well?

Quick remediation of redis-tracechunks saturation which allowed the CI workload to continue again: users were able to see traces of running jobs again

What was difficult?

Difficult to troubleshoot why jobs weren't progressing. The workers seemed to be stuck, but no other resources seemed saturated. Better observability for ruby workloads would have helped here.
Correlation between the initial database incident (#18535 (closed)) and the Sidekiq slowdown (#18538 (closed)) weren't easily correlated.

Review Guidelines

This review should be completed by the team which owns the service causing the alert. That team has the most context around what caused the problem and what information will be needed for an effective fix. The EOC or IMOC may create this issue, but unless they are also on the service owning team, they should assign someone from that team as the DRI.

For the person opening the Incident Review

Set the title to Incident Review: (Incident issue name)
Assign a Service::* label (most likely matching the one on the incident issue)
Set a Severity::* label which matches the incident
In the Key Information section, make sure to include a link to the incident issue
Find and Assign a DRI from the team which owns the service (check their slack channel or assign the team's manager) The DRI for the incident review is the issue assignee.
Announce the incident review in the incident channel on Slack.

:mega: @here An incident review issue was created for this incident with <USER> assigned as the DRI.
If you have any review feedback please add it to <ISSUE_LINK>.

For the assigned DRI

Fill in the remaining fields in the Key Information section, using the incident issue as a reference. Feel free to ask the EOC or other folks involved if anything is difficult to find.
If there are metrics showing Customers Affected or Requests Affected, link those metrics in those fields
Create a few short sentences in the Summary section summarizing what happened (TL;DR)
Use the description section to write a few paragraphs explaining what happened
Link any corrective actions and describe any other actions or outcomes from the incident
Consider the implications for self-managed and Dedicated instances. For example, do any bug fixes need to be backported?
Add any appropriate labels based on the incident issue and discussions
Once discussion wraps up in the comments, summarize any takeaways in the details section
Close the review before the due date

Edited Sep 25, 2024 by Bob Van Landuyt