2025-09-14: Sidekiq queueing SLI apdex SLO violation on catchall shard

Sidekiq queueing SLI apdex SLO violation on catchall shard (Severity 4 (Low))

Problem: A spike in Sidekiq job queueing times occurred on the catchall shard, breaching the queueing SLO and causing some jobs to miss their target queueing durations.

Impact: Sidekiq jobs on the catchall shard experienced delays in queueing, with apdex dropping to 99.04% over a 6-hour window. This caused some background jobs to take longer than their defined urgency targets.

Causes: A scheduled job to delete inactive resource access tokens triggered a large number of Sidekiq jobs at once, which saturated the catchall shard's resources. This led to a backlog and delays as thousands of users were processed in one batch without batching or throttling.

Response strategy: We reviewed relevant worker and dashboard metrics, confirmed no recent code changes, and created a merge request to relax the queueing SLO threshold for the catchall shard to reduce alert sensitivity.


This ticket was created to track INC-3927, by incident.io 🔥