Skip to content

2025-09-30: Sidekiq queueing SLI apdex SLO violation on catchall shard (apdex 66.79%)

Sidekiq queueing SLI apdex SLO violation on catchall shard (apdex 66.79%) (Severity 3 (Medium))

Problem: Low-urgency sidekiq jobs were delayed for up to nearly 2 hours due to saturation induced by a flood of jobs. High urgency jobs were unaffected.

Impact: Users on GitLab.com experienced delays in job execution. Support tickets confirmed user-facing service degradation. As of 2025-09-30 20:00 UTC, the backlog of pending jobs has completed, and user experience is back to normal.

Causes: A surge of 'Releases::CreateEvidenceWorker' jobs triggered by a bulk import API request led to saturation of a database connection pool shared by low-urgency jobs. This demand spike led to all low-urgency jobs collectively accumulating a standing backlog. Bulkheading prevented a wider scope of impact.

Response strategy: Consistent with the theme of improving contention management for shared resources, we plan to prevent the triggering job class from consuming so much of the connection pool that other job classes starve. A class-specific mitigation is being discussed in production-engineering#26700 (closed), and more generic reactive throttling is also planned for the near future.


This ticket was created to track INC-4312, by incident.io 🔥