Incident Review: Sidekiq queueing SLO violation on multiple shards

INC-5372: Sidekiq queueing SLO violation on multiple shards

Generated by Terri Chu on 30 Oct 2025 18:29. All timestamps are local to Etc/UTC

Key Information

Metric	Value
Customers Affected
Requests Affected
Incident Severity	Severity 2 (High)
Impact Start Time	Thu, 30 Oct 2025 15:00:00 UTC
Impact End Time	Thu, 30 Oct 2025 17:55:00 UTC
Total Duration	3 hours, 8 minutes
Link to Incident Issue	https://app.incident.io/gitlab/incidents/01K8TV0EHCKAA6R9N1YHZ8BX6R

Summary

Problem: Multiple Sidekiq shards reached full capacity, leading to queueing SLO violations and a backlog of delayed jobs.

Impact: Sidekiq job queueing SLO violations affected multiple shards, causing delays in background job processing and degraded performance across several teams and services.

Causes: A spike in activity from the Security::SyncProjectPolicyWorker on the catch-all Sidekiq shard dominated processing for a period, crowding out other jobs and causing saturation and delays. The WebHooks::LogExecutionWorker also contributed a large share of long-running jobs during this incident window. A single group security policy change generated thousands of jobs, overwhelming the queue even with concurrency limits in place.

Response strategy: We temporarily increased the maximum pod limits for the low-urgency CPU-bound and catchall Sidekiq shards to clear the backlog. After deploying these changes, processing capacity improved and the job backlog cleared. The Apdex score for the catchall queue recovered from 1% to 95.5%. We will revert the pod increases once the queues remain stable.

What went well?

Use this section to highlight what went well during the incident. Capturing this helps understand informal processes and expertise, and enables undocumented knowledge to be shared.

Example:

We quickly discovered a recently changed feature flag through the event log which enabled fast mitigation of the impact, as well as pulling in the engineer involved to further diagnose.
We escalated through dev escalations, which brought in Person X. They knew that Person Y had expertise with the component in question, which enabled faster diagnosis.

What was difficult?

Use this section to highlight opportunities for improvement discovered during the incident. Capturing this helps understand informal processes and expertise, and enables undocumented knowledge to be shared. If the improvement seems like a simplest change, consider adding it as a corrective action above instead. Think about how to improve response next time, and consider any patterns pointing to broader issues, like “key person risk.”

Example:

The runbooks/playbooks for this service are out of date and did not contain the information necessary to troubleshoot the incident.
The incident happened at a time when nobody with expertise on the service was available.

Investigation Details

Timeline

Incident Timeline

2025-10-30

15:00:00 Impact started at

Custom timestamp "Impact started at" occurred

15:15:33 Incident reported in triage by Prometheus Alertmanager alert

Prometheus Alertmanager alert reported the incident

Severity: None

Status: Triage

15:35:15 Incident accepted

Alex Hanselka shared an update

Severity: ~~None~~ → Severity 3 (Medium)

Status: ~~Triage~~ → Investigating

15:36:16 Message from Alex Hanselka

Alex Hanselka pinned their own message

The saturation for the HPA for both low-urgency-cpu-bound and urgent-other is at 100% and has been for a while. I think we should expand the shards a bit.

15:50:47 Message from Pravar Gauba

Pravar Gauba's message was pinned by Alex Hanselka

Im suspecting this might be the reason here as well, since workers are being deferred, so there are random queue spikes which leads to SLO violation:

https://gitlab.enterprise.slack.com/archives/C09PBL848MB/p1761794111127159?thread_ts=1761749358.146059&channel=C09PBL848MB&message_ts=1761794111.127159

15:52:45 Message from Sarah Walker

Sarah Walker's message was pinned by Alex Hanselka

I agree that we should expand the shards

15:55:21 Message from Alex Hanselka

Alex Hanselka pinned their own message

I'm mildly apprehensive of this because of the pgbouncer alert. Though that cleared on it's own

16:14:02 Severity upgraded from Severity 3 (Medium) → Severity 2 (High)

Alex Hanselka shared an update

Severity: ~~Severity 3 (Medium)~~ → Severity 2 (High)

The incident now affects multiple Sidekiq shards: 'urgent-other', 'low-urgency-cpu-bound', and 'catchall', all of which have reached 100% Horizontal Pod Autoscaler (HPA) saturation. This saturation is causing Sidekiq job queueing SLO violations, with performance degradation across several teams and services.

Active discussions are ongoing about increasing the max replica counts for the saturated shards, but current limits are already high (e.g., 750 for 'catchall').

The overall queue size is continuing to increase and request rates are dropping, raising concerns about further degradation.

No mitigation actions have been taken yet, but the team is considering expanding shard capacity to address the saturation.

16:22:10 Message from Sarah Walker

Sarah Walker's message was pinned by Alex Hanselka

https://dashboards.gitlab.net/goto/af2mlcd3x6v40a?orgId=1 - throughput per job

16:23:35 Message from Terri Chu

Terri Chu's message was pinned by Alex Hanselka

https://dashboards.gitlab.net/goto/ff2mlhanbdtz4a?orgId=1

16:25:11 Image posted by Sarah Walker

Sarah Walker posted an image to the channel

Security worker is there, just a very faint colour

16:37:52 Image posted by Terri Chu

Terri Chu posted an image to the channel

https://log.gprd.gitlab.net/app/r/s/ukpZA

16:39:16 Image posted by Alex Hanselka

Alex Hanselka posted an image to the channel

More security worker suspicions.

16:42:48 Update shared

Alex Hanselka shared an update

The investigation has identified that the 'Security::SyncProjectPolicyWorker' experienced a significant spike on the catch-all Sidekiq shard, which coincided with a sharp drop in throughput for other job types. This worker dominated processing for a period, crowding out other jobs and contributing to the backlog. Metrics and inflight job graphs support this as a major factor in the incident.

Review of Kibana logs and dashboards shows that 'WebHooks::LogExecutionWorker' is also responsible for a large share of long-running jobs during this incident window, but the main bottleneck appears to be linked to the surge in activity from the security worker. The team is seeking more information about the internal logic of this worker to better understand its impact.

Processing on the affected shards is beginning to recover, with queue lengths coming down and throughput improving over the last 10 minutes.

16:53:54 Message from Alex Hanselka

Alex Hanselka's message was pinned by Terri Chu

pod info dashboard: https://dashboards.gitlab.net/goto/df2mo6dfu6fwga?orgId=1

17:17:11 Update shared

Alex Hanselka shared an update

We going to implement a temporary increase to the maximum pod limits for the low-urgency CPU-bound and catchall Sidekiq shards to help clear the job queue backlog. This change is tracked in MR #4890, and we plan to revert the pod limits once the backlog subsides. However, due to the sidekiq queue, the CI jobs required for the MR have not started.

The primary cause remains a spike in Security::SyncProjectPolicyWorker activity, which crowded out other jobs and led to saturation and delays.

Queue lengths and throughput are gradually improving, but some delays in job and pipeline processing persist for customers.

17:20:10 Image posted by Terri Chu

Terri Chu posted an image to the channel

https://dashboards.gitlab.net/goto/cf2mqhwxcxpmof?orgId=1

17:27:27 Message from Terri Chu

Terri Chu pinned their own message

concurrency limit is set to 200 (https://gitlab.com/gitlab-org/gitlab/-/blob/1de138b0c5cd97a850b91ffc1e6b7cf5a4bbccd0/ee/app/workers/security/sync_project_policies_worker.rb#L13) for Security::SyncProjectPoliciesWorker . this worker calls Security::SyncProjectPolicyWorker which has a concurrency limit of 200 (https://gitlab.com/gitlab-org/gitlab/-/blob/1de138b0c5cd97a850b91ffc1e6b7cf5a4bbccd0/ee/app/workers/security/sync_project_policy_worker.rb#L13)

17:45:16 Image posted by Sarah Walker

Sarah Walker posted an image to the channel

Increasing the max replicas has definitely increased job capacity

17:52:00 Monitoring at

Custom timestamp "Monitoring at" occurred

17:52:00 Identified at

Custom timestamp "Identified at" occurred

17:52:00 Status changed from Investigating → Monitoring

Alex Hanselka shared an update

Status: ~~Investigating~~ → Monitoring

After deploying the merge request to increase max replicas, Sidekiq processing capacity has improved significantly.

The Apdex score for the catchall queue is recovering, rising quickly from 1% to 95.5%. No new paging alerts have been triggered, but we observed persistent saturation of the PGBalancer Async Primary Pool, which has been at 100% for several hours. This may require further investigation, though it has not paged and does not appear to be directly impacting current recovery.

We continue to monitor for any anomalies, but the backlog is cleared and job processing rates are much improved.

17:55:00 Fixed at

Custom timestamp "Fixed at" occurred

18:23:27 Incident resolved and entered the post-incident flow

Alex Hanselka shared an update

Status: ~~Monitoring~~ → Documenting

The system remains stable and the incident is resolved.

Investigation Notes

Any details you may want to add about the investigation can go here.

Follow-ups

Follow-up

Owner

Lower maxreplicas for catchall and low-urgency-cpu-bound shards

Alex Hanselka

investigate what SyncProjectPoliciesWorker does and contributing factors

Unassigned

Review Guidelines

This review should be completed by the team which owns the service causing the alert. That team has the most context around what caused the problem and what information will be needed for an effective fix. The EOC or IMOC may create this issue, but unless they are also on the service owning team, they should assign someone from that team as the DRI.

For the person opening the Incident Review

Set the title to Incident Review: (Incident issue name)
Assign a Service::* label (most likely matching the one on the incident issue)
Set a Severity::* label which matches the incident
In the Key Information section, make sure to include a link to the incident issue
Find and Assign a DRI from the team which owns the service (check their slack channel or assign the team's manager) The DRI for the incident review is the issue assignee.

For the assigned DRI

Fill in the remaining fields in the Key Information section, using the incident issue as a reference. Feel free to ask the EOC or other folks involved if anything is difficult to find.
If there are metrics showing Customers Affected or Requests Affected, link those metrics in those fields
Create a few short sentences in the Summary section summarizing what happened (TL;DR)
Link any corrective actions and describe any other actions or outcomes from the incident
Consider the implications for self-managed and Dedicated instances. For example, do any bug fixes need to be backported?
Once discussion wraps up in the comments, summarize any takeaways in the details section
If the incident timeline does not contain any sensitive information and this review can be made public, turn off the issue's confidential mode and link this review to the incident issue.
Close the review before the due date
Go back to the incident channel or page and close out the remaining post-incident tasks

Edited Nov 05, 2025 by Alan (Maciej) Paruszewski