2020-03-06: False alert about low Sidekiq SLO due to metrics changes
Summary
An application change renamed a Prometheus metric label but our recording rules weren't completely updated to accommodate the change, so it was perceived as an SLO drop while it merely incomplete data that caused the drop.
Timeline
All times UTC.
2020-03-06
- 09:19 - We're alerted about low Sidekiq SLO
- 09:25 - Jobs seem to be processing fine from the looks of the admin dashboard
- 09:36 - We notice some Grafana dashboards stopped showing metrics
- 09:49 - We find that a Prometheus metric had its label changed as part of a recent application change
- 10:19 - The alert auto-resolves
- 10:28 - We merge a fix to update the metrics recording rules
- 10:30 - The SLO drops again after we rolled out the new updates
- 11:35 - The SLO is now above threshold
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)
Edited by Ahmad Sherif