Retrospective on Sidekiq July availability
Our Sidekiq availability score for July was 99.51% which is far below our target availability of 99.95%. See https://gitlab.com/gitlab-com/gl-infra/reliability-reports/-/issues/183+
While most of the availability drop was due to the the severity1 incidents in July, this retrospective issue will continue to explore whether we should fold Sidekiq into our overall availability and make sure all availability drops correspond to incidents that we can explain.
There were three major drops in availability in July, and two smaller drops:
- Do users feel impact from the frequent availability drops? YES
- Is 99.95% a reasonable goal for Sidekiq, or do we need to change the way we measure Sidekiq availability? I think we should keep this target
- Did any of the High Severity User-impacting Issues for the month of July map to drops in availability? Yes, 2 of the three large drops mapped to the severity1 incidents. The third one maps to production#16024 (closed) which we might consider making a higher severity
- Should we fold Sidekiq into our overall availability for July? We should continue to wait until https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/23793+ is completed
🔥 Major drops in availability
Availability drop | Duration | Notes |
---|---|---|
July 7th, 16:00-19:00 | 3h | Due to 2023-07-07: Site-wide outage triggered by resta... (production#15997 - closed) severity1 |
July 11th, 16:00-16:30 | 30min | Due to 2023-07-11: Apdex SLO alert for sidekiq shard u... (production#16024 - closed) increased startup time due to external dependencies severity3 |
July 15th, 00:50-01:23, 02:08-03:02 | 1hr30min | Due to 2023-07-14: Intermittent apdex dips for web and... (production#16042 - closed) severity1 |
🗒 Smaller drops in availability
- July 13th between 14:51 and 15:30 - Only around 15minutes of impact, these drops didn't cause any alerts
- July 18th - July 19th: Captured in 2023-07-17: shard_urgent_cpu_bound SLI of the s... (production#16050 - closed) where we resolved the issue by increasing the concurrency of the
urgent-cpu-bound
in gitlab-com/gl-infra/k8s-workloads/gitlab-com!2874 (merged)
📝 Alerts unrelated to availability drops
From the 28th-31st there were 11 pages for the Sidekiq service due to hourly queue spikes, more information in production#16096 (closed) . There are no availability drops that correspond to these alerts.
Edited by John Jarvis