2021-07-13: The rails_redis_client SLI of the redis-sidekiq service (`main` stage) has an apdex violating SLO
Current Status
Issue is resolved.
For approximately 4 hours there were delays in job processing (CI, async operations, anything involving Sidekiq) due to excessive queuing due to database pool saturation.
Once the connection pool reached saturation, all sidekiq jobs were affected by the contention. Some jobs need to serially lease a db connection multiple times over the lifespan of the job's execution, and each time it requests a lease, it incurs a delay at the connection pooler (sometimes as long as 500-1000 ms). So a job class that politely leases (and releases) its db connection 10 times would incur 5-10 seconds of penalty while the shared connection pool was saturated.
See #5158 (comment 626640695) for the root cause
Timeline
View recent production deployment and configuration events / gcp events (internal only)
All times UTC.
2021-07-13
-
19:45
- Queuing starts on sidekiq causing job delays -
20:04
- @nnelson declares incident in Slack.
2021-07-14
-
00:10
- Issue is resolved
Corrective Actions
Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.
- Rapid Action: Investigate and address Sidekiq saturation gitlab-org&6361
- Sidekiq: Prevent
ArchiveTraceWorker
from holding client transactions open gitlab-org/gitlab#322142 (closed) - Zoom call to discuss corrective actions: https://youtu.be/ECpraohmcP8
- The
AuthorizedProjectUpdate::UserRefreshFromReplicaWorker
were routed to thecatchall
fleet. These jobs are clearly very much CPU-bound. They should have been routed to acpu-bound
shard rather than thecatchall
shard which is intended to service non-CPU-bound sidekiq workers. gitlab-org/gitlab#336285 (closed)
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.
Incident Review
Summary
For a period of 265
minutes (between 2021-07-13 19:45 UTC
and 2021-07-14 00:10 UTC
), GitLab.com experienced a large drop in job throughput because of excessive Ruby Thread Contention in Sidekiq due to a large number of AuthorizedProjectUpdate::UserRefreshFromReplicaWorker
Sidekiq jobs triggered by legitimate customer activity. This starved other jobs of CPU cycles they need to complete network IO related tasks, expanding db transaction times, and ultimately saturating the Sidekiq database connection pool resulting in site-wide service degradation.
- Service(s) affected: ServiceCI Runners ServiceGit ~"Service::Puma" ServiceRedis ServicePgbouncer ServiceWeb ServiceWorkhorse
- Team attribution: ~"team::Scalability"
- Time to detection:
17 minutes
(Degradation began at19:45 UTC
. Alerted at20:02 UTC
.) - Minutes downtime or degradation:
265 minutes
Metrics
Backlog of scheduled sidekiq jobs:
Breakdown per queue:
Ruby thread contention on sidekiq during event:
Spike in AuthorizedProjectUpdate::UserRefreshFromReplicaWorker
Sidekiq jobs:
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- external customers, internal customers
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- Customers reported being unable to merge branches to primary branches.
- Customers reported pipeline jobs timing out, and not starting.
-
How many customers were affected?
- 100%
What were the root causes?
- There were
116
legitimate API calls invoked by a single marquee customer with a very deep group hierarchy resulting in1.2 million
AuthorizedProjectUpdate::UserRefreshFromReplicaWorker
Sidekiq jobs to be enqueued which starved sidekiq of CPU to process other jobs.
Incident Response Analysis
-
How was the incident detected?
- The Site Reliability Engineer On-call was paged.
-
How could detection time be improved?
- Possibly alerting on an elevation in the results of log queries for the specific API invocations in question.
-
How was the root cause diagnosed?
- Please refer to this comment: #5158 (comment 626640695)
-
How could time to diagnosis be improved?
- Possibly alerting on an elevation in the results of log queries for the specific API invocations in question.
-
How did we reach the point where we knew how to mitigate the impact?
- A previous incident with similar consequences had led to the development of the mitigation tactic re-employed in this incident.
-
How could time to mitigation be improved?
- More rapidly identifying the character of the incident.
-
What went well?
- Engineers with experience from the previous similar incident had documented mitigation tactics and were prepared to respond quickly.
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- Not the same root cause, but a similar systemic characteristic was in play for this incident as the previous day's.
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- Mostly all such infrastructure issues have been closed complete.
- A somewhat related issue is in progress: scalability#1152 (closed)
- Mostly all such infrastructure issues have been closed complete.
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- No.
Lessons Learned
- Two queues (now sharded) were primarily responsible for the explosion in queue lengths:
authorized_project_update:authorized_project_update_user_refresh_from_replica
authorized_project_update:authorized_project_update_user_refresh_with_low_urgency
- All queues should be sharded and autoscaled independently. This will isolate queues, and prevent growth in any single queue from starving other queues of workers.