2026-05-12: Error rates violating SLO (#22103) · Issues · GitLab.com / GitLab Infrastructure Team / Production · GitLab

2026-05-12: Error rates violating SLO

# Error rates violating SLO (Severity 1) **Problem**: A major Redis Sidekiq cluster outage led to widespread elevated error rates, authentication failures, and job processing delays for all major GitLab.com services, including web, CI runners, and Git. **Impact**: Many users experienced frequent 500 errors, CI pipeline failures, stuck or delayed pipelines, and unreliable Git access. Error rates peaked at 5% for CI runners and affected almost all services, leading to a high volume of support tickets. At its worst, the Sidekiq queue backlog reached over 12 million jobs, causing pipeline creation and job processing to stall. **Causes**: Earlier mitigation for a previous incident led to audit event jobs being deferred, which then accumulated and consumed all available memory on the Redis Sidekiq cluster. This memory exhaustion resulted in kernel errors and caused multiple Redis nodes to fail, breaking the cluster quorum. **Response strategy**: Engineers performed targeted restarts and forced resets on failed Redis nodes, then sequentially resized all Redis VMs to double their memory and CPU. As of 18:40 UTC, all three nodes are upgraded, error rates have dropped, and the system is recovering. The large Sidekiq queue backlog is now draining and being monitored. Long-term improvements are planned for the problematic audit event worker to prevent recurrence. _This ticket was created to track_ [_INC-10096_](https://app.incident.io/gitlab/incidents/10096)_, by_ [_incident.io_](https://app.incident.io) 🔥

issue