Incident Review: Sidekiq queueing SLO violation on multiple shards
INC-5372: Sidekiq queueing SLO violation on multiple shards
Generated by Terri Chu on 30 Oct 2025 18:29. All timestamps are local to Etc/UTC
Key Information
| Metric | Value |
|---|---|
| Customers Affected | |
| Requests Affected | |
| Incident Severity | Severity 2 (High) |
| Impact Start Time | Thu, 30 Oct 2025 15:00:00 UTC |
| Impact End Time | Thu, 30 Oct 2025 17:55:00 UTC |
| Total Duration | 3 hours, 8 minutes |
| Link to Incident Issue | https://app.incident.io/gitlab/incidents/01K8TV0EHCKAA6R9N1YHZ8BX6R |
|
Summary
Problem: Multiple Sidekiq shards reached full capacity, leading to queueing SLO violations and a backlog of delayed jobs.
Impact: Sidekiq job queueing SLO violations affected multiple shards, causing delays in background job processing and degraded performance across several teams and services.
Causes: A spike in activity from the Security::SyncProjectPolicyWorker on the catch-all Sidekiq shard dominated processing for a period, crowding out other jobs and causing saturation and delays. The WebHooks::LogExecutionWorker also contributed a large share of long-running jobs during this incident window. A single group security policy change generated thousands of jobs, overwhelming the queue even with concurrency limits in place.
Response strategy: We temporarily increased the maximum pod limits for the low-urgency CPU-bound and catchall Sidekiq shards to clear the backlog. After deploying these changes, processing capacity improved and the job backlog cleared. The Apdex score for the catchall queue recovered from 1% to 95.5%. We will revert the pod increases once the queues remain stable.
What went well?
Use this section to highlight what went well during the incident. Capturing this helps understand informal processes and expertise, and enables undocumented knowledge to be shared.
Example:
-
We quickly discovered a recently changed feature flag through the event log which enabled fast mitigation of the impact, as well as pulling in the engineer involved to further diagnose.
-
We escalated through dev escalations, which brought in Person X. They knew that Person Y had expertise with the component in question, which enabled faster diagnosis.
What was difficult?
Use this section to highlight opportunities for improvement discovered during the incident. Capturing this helps understand informal processes and expertise, and enables undocumented knowledge to be shared. If the improvement seems like a simplest change, consider adding it as a corrective action above instead. Think about how to improve response next time, and consider any patterns pointing to broader issues, like “key person risk.”
Example:
- The runbooks/playbooks for this service are out of date and did not contain the information necessary to troubleshoot the incident.
- The incident happened at a time when nobody with expertise on the service was available.
Investigation Details
Timeline
Incident Timeline
2025-10-30
15:00:00 Impact started at
Custom timestamp "Impact started at" occurred
15:15:33 Incident reported in triage by Prometheus Alertmanager alert
Prometheus Alertmanager alert reported the incident
Severity: None
Status: Triage
15:35:15 Incident accepted
Alex Hanselka shared an update
Severity: None → Severity 3 (Medium)
Status: Triage → Investigating
15:36:16 Message from Alex Hanselka
Alex Hanselka pinned their own message
The saturation for the HPA for both low-urgency-cpu-bound and urgent-other is at 100% and has been for a while. I think we should expand the shards a bit.
15:50:47 Message from Pravar Gauba
Pravar Gauba's message was pinned by Alex Hanselka
Im suspecting this might be the reason here as well, since workers are being deferred, so there are random queue spikes which leads to SLO violation:
15:52:45 Message from Sarah Walker
Sarah Walker's message was pinned by Alex Hanselka
I agree that we should expand the shards
15:55:21 Message from Alex Hanselka
Alex Hanselka pinned their own message
I'm mildly apprehensive of this because of the pgbouncer alert. Though that cleared on it's own
16:14:02 Severity upgraded from Severity 3 (Medium) → Severity 2 (High)
Alex Hanselka shared an update
Severity: Severity 3 (Medium) → Severity 2 (High)
The incident now affects multiple Sidekiq shards: 'urgent-other', 'low-urgency-cpu-bound', and 'catchall', all of which have reached 100% Horizontal Pod Autoscaler (HPA) saturation. This saturation is causing Sidekiq job queueing SLO violations, with performance degradation across several teams and services.
Active discussions are ongoing about increasing the max replica counts for the saturated shards, but current limits are already high (e.g., 750 for 'catchall').
The overall queue size is continuing to increase and request rates are dropping, raising concerns about further degradation.
No mitigation actions have been taken yet, but the team is considering expanding shard capacity to address the saturation.
16:22:10 Message from Sarah Walker
Sarah Walker's message was pinned by Alex Hanselka
https://dashboards.gitlab.net/goto/af2mlcd3x6v40a?orgId=1 - throughput per job
16:23:35 Message from Terri Chu
Terri Chu's message was pinned by Alex Hanselka
https://dashboards.gitlab.net/goto/ff2mlhanbdtz4a?orgId=1
16:25:11 Image posted by Sarah Walker
Sarah Walker posted an image to the channel
Security worker is there, just a very faint colour
16:37:52 Image posted by Terri Chu
Terri Chu posted an image to the channel
https://log.gprd.gitlab.net/app/r/s/ukpZA
16:39:16 Image posted by Alex Hanselka
Alex Hanselka posted an image to the channel
More security worker suspicions.
16:42:48 Update shared
Alex Hanselka shared an update
The investigation has identified that the 'Security::SyncProjectPolicyWorker' experienced a significant spike on the catch-all Sidekiq shard, which coincided with a sharp drop in throughput for other job types. This worker dominated processing for a period, crowding out other jobs and contributing to the backlog. Metrics and inflight job graphs support this as a major factor in the incident.
Review of Kibana logs and dashboards shows that 'WebHooks::LogExecutionWorker' is also responsible for a large share of long-running jobs during this incident window, but the main bottleneck appears to be linked to the surge in activity from the security worker. The team is seeking more information about the internal logic of this worker to better understand its impact.
Processing on the affected shards is beginning to recover, with queue lengths coming down and throughput improving over the last 10 minutes.
16:53:54 Message from Alex Hanselka
Alex Hanselka's message was pinned by Terri Chu
pod info dashboard: https://dashboards.gitlab.net/goto/df2mo6dfu6fwga?orgId=1
17:17:11 Update shared
Alex Hanselka shared an update
We going to implement a temporary increase to the maximum pod limits for the low-urgency CPU-bound and catchall Sidekiq shards to help clear the job queue backlog. This change is tracked in MR #4890, and we plan to revert the pod limits once the backlog subsides. However, due to the sidekiq queue, the CI jobs required for the MR have not started.
The primary cause remains a spike in Security::SyncProjectPolicyWorker activity, which crowded out other jobs and led to saturation and delays.
Queue lengths and throughput are gradually improving, but some delays in job and pipeline processing persist for customers.
17:20:10 Image posted by Terri Chu
Terri Chu posted an image to the channel
https://dashboards.gitlab.net/goto/cf2mqhwxcxpmof?orgId=1
17:27:27 Message from Terri Chu
Terri Chu pinned their own message
concurrency limit is set to 200 (https://gitlab.com/gitlab-org/gitlab/-/blob/1de138b0c5cd97a850b91ffc1e6b7cf5a4bbccd0/ee/app/workers/security/sync_project_policies_worker.rb#L13) for Security::SyncProjectPoliciesWorker . this worker calls Security::SyncProjectPolicyWorker which has a concurrency limit of 200 (https://gitlab.com/gitlab-org/gitlab/-/blob/1de138b0c5cd97a850b91ffc1e6b7cf5a4bbccd0/ee/app/workers/security/sync_project_policy_worker.rb#L13)
17:45:16 Image posted by Sarah Walker
Sarah Walker posted an image to the channel
Increasing the max replicas has definitely increased job capacity
17:52:00 Monitoring at
Custom timestamp "Monitoring at" occurred
17:52:00 Identified at
Custom timestamp "Identified at" occurred
17:52:00 Status changed from Investigating → Monitoring
Alex Hanselka shared an update
Status: Investigating → Monitoring
After deploying the merge request to increase max replicas, Sidekiq processing capacity has improved significantly.
The Apdex score for the catchall queue is recovering, rising quickly from 1% to 95.5%. No new paging alerts have been triggered, but we observed persistent saturation of the PGBalancer Async Primary Pool, which has been at 100% for several hours. This may require further investigation, though it has not paged and does not appear to be directly impacting current recovery.
We continue to monitor for any anomalies, but the backlog is cleared and job processing rates are much improved.
17:55:00 Fixed at
Custom timestamp "Fixed at" occurred
18:23:27 Incident resolved and entered the post-incident flow
Alex Hanselka shared an update
Status: Monitoring → Documenting
The system remains stable and the incident is resolved.
Investigation Notes
Follow-ups
Follow-up
Owner
Lower maxreplicas for catchall and low-urgency-cpu-bound shards
Alex Hanselka
investigate what SyncProjectPoliciesWorker does and contributing factors
Unassigned
Review Guidelines
This review should be completed by the team which owns the service causing the alert. That team has the most context around what caused the problem and what information will be needed for an effective fix. The EOC or IMOC may create this issue, but unless they are also on the service owning team, they should assign someone from that team as the DRI.
For the person opening the Incident Review
-
Set the title to Incident Review: (Incident issue name) -
Assign a Service::*label (most likely matching the one on the incident issue) -
Set a Severity::*label which matches the incident -
In the Key Informationsection, make sure to include a link to the incident issue -
Find and Assign a DRI from the team which owns the service (check their slack channel or assign the team's manager) The DRI for the incident review is the issue assignee.
For the assigned DRI
-
Fill in the remaining fields in the Key Informationsection, using the incident issue as a reference. Feel free to ask the EOC or other folks involved if anything is difficult to find. -
If there are metrics showing Customers AffectedorRequests Affected, link those metrics in those fields -
Create a few short sentences in the Summary section summarizing what happened (TL;DR) -
Link any corrective actions and describe any other actions or outcomes from the incident -
Consider the implications for self-managed and Dedicated instances. For example, do any bug fixes need to be backported? -
Once discussion wraps up in the comments, summarize any takeaways in the details section -
If the incident timeline does not contain any sensitive information and this review can be made public, turn off the issue's confidential mode and link this review to the incident issue. -
Close the review before the due date -
Go back to the incident channel or page and close out the remaining post-incident tasks