2021-07-13: The rails_redis_client SLI of the redis-sidekiq service (`main` stage) has an apdex violating SLO

Current Status

Issue is resolved.

For approximately 4 hours there were delays in job processing (CI, async operations, anything involving Sidekiq) due to excessive queuing due to database pool saturation.

Once the connection pool reached saturation, all sidekiq jobs were affected by the contention. Some jobs need to serially lease a db connection multiple times over the lifespan of the job's execution, and each time it requests a lease, it incurs a delay at the connection pooler (sometimes as long as 500-1000 ms). So a job class that politely leases (and releases) its db connection 10 times would incur 5-10 seconds of penalty while the shared connection pool was saturated.

#5158 (comment 625615243)

See #5158 (comment 626640695) for the root cause

Timeline

View recent production deployment and configuration events / gcp events (internal only)

All times UTC.

2021-07-13

19:45 - Queuing starts on sidekiq causing job delays
20:04 - @nnelson declares incident in Slack.

2021-07-14

00:10 - Issue is resolved

Corrective Actions

Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.

Rapid Action: Investigate and address Sidekiq saturation gitlab-org&6361
Sidekiq: Prevent ArchiveTraceWorker from holding client transactions open gitlab-org/gitlab#322142 (closed)
Zoom call to discuss corrective actions: https://youtu.be/ECpraohmcP8
The AuthorizedProjectUpdate::UserRefreshFromReplicaWorker were routed to the catchall fleet. These jobs are clearly very much CPU-bound. They should have been routed to a cpu-bound shard rather than the catchall shard which is intended to service non-CPU-bound sidekiq workers. gitlab-org/gitlab#336285 (closed)

Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.

Incident Review

Summary

For a period of 265 minutes (between 2021-07-13 19:45 UTC and 2021-07-14 00:10 UTC), GitLab.com experienced a large drop in job throughput because of excessive Ruby Thread Contention in Sidekiq due to a large number of AuthorizedProjectUpdate::UserRefreshFromReplicaWorker Sidekiq jobs triggered by legitimate customer activity. This starved other jobs of CPU cycles they need to complete network IO related tasks, expanding db transaction times, and ultimately saturating the Sidekiq database connection pool resulting in site-wide service degradation.

Service(s) affected: ServiceCI Runners ServiceGit ~"Service::Puma" ServiceRedis ServicePgbouncer ServiceWeb ServiceWorkhorse
Team attribution: ~"team::Scalability"
Time to detection: 17 minutes (Degradation began at 19:45 UTC. Alerted at 20:02 UTC.)
Minutes downtime or degradation: 265 minutes

Metrics

Backlog of scheduled sidekiq jobs:

Breakdown per queue:

Ruby thread contention on sidekiq during event:

Spike in AuthorizedProjectUpdate::UserRefreshFromReplicaWorker Sidekiq jobs:

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. external customers, internal customers
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. Customers reported being unable to merge branches to primary branches.
2. Customers reported pipeline jobs timing out, and not starting.
How many customers were affected?
1. 100%

What were the root causes?

There were 116 legitimate API calls invoked by a single marquee customer with a very deep group hierarchy resulting in 1.2 million AuthorizedProjectUpdate::UserRefreshFromReplicaWorker Sidekiq jobs to be enqueued which starved sidekiq of CPU to process other jobs.

Incident Response Analysis

How was the incident detected?
1. The Site Reliability Engineer On-call was paged.
How could detection time be improved?
1. Possibly alerting on an elevation in the results of log queries for the specific API invocations in question.
How was the root cause diagnosed?
1. Please refer to this comment: #5158 (comment 626640695)
How could time to diagnosis be improved?
1. Possibly alerting on an elevation in the results of log queries for the specific API invocations in question.
How did we reach the point where we knew how to mitigate the impact?
1. A previous incident with similar consequences had led to the development of the mitigation tactic re-employed in this incident.
How could time to mitigation be improved?
1. More rapidly identifying the character of the incident.
What went well?
1. Engineers with experience from the previous similar incident had documented mitigation tactics and were prepared to respond quickly.

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. Not the same root cause, but a similar systemic characteristic was in play for this incident as the previous day's.
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. Mostly all such infrastructure issues have been closed complete.
  1. A somewhat related issue is in progress: scalability#1152 (closed)
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. No.

Lessons Learned

Two queues (now sharded) were primarily responsible for the explosion in queue lengths:
- authorized_project_update:authorized_project_update_user_refresh_from_replica
- authorized_project_update:authorized_project_update_user_refresh_with_low_urgency
All queues should be sharded and autoscaled independently. This will isolate queues, and prevent growth in any single queue from starving other queues of workers.

Guidelines

Blameless RCA Guideline

Edited Aug 04, 2021 by Steve Loyd