2025-10-30: Sidekiq queueing SLO violation on multiple shards

Sidekiq queueing SLO violation on multiple shards (Severity 2 (High))

Problem: Multiple Sidekiq shards reached full capacity, leading to queueing SLO violations and a backlog of delayed jobs.

Impact: Sidekiq job queueing SLO violations affected multiple shards, causing delays in background job processing and degraded performance across several teams and services.

Causes: A spike in activity from the Security::SyncProjectPolicyWorker on the catch-all Sidekiq shard dominated processing for a period, crowding out other jobs and causing saturation and delays. The WebHooks::LogExecutionWorker also contributed a large share of long-running jobs during this incident window. A single group security policy change generated thousands of jobs, overwhelming the queue even with concurrency limits in place.

Response strategy: We temporarily increased the maximum pod limits for the low-urgency CPU-bound and catchall Sidekiq shards to clear the backlog. After deploying these changes, processing capacity improved and the job backlog cleared. The Apdex score for the catchall queue recovered from 1% to 95.5%. We will revert the pod increases once the queues remain stable.


This ticket was created to track INC-5372, by incident.io 🔥