Lack of observability
Summary
More information will be added as we investigate the issue.
Queries in Thanos are failing, Grafana dashboards are unavailable
Timeline
All times UTC.
2020-04-21
- 19:00:00 gitlab-com/runbooks!2135 (merged) change is made to rules which results in a circular dependency in Thanos
- 23:40:00 transaction durations start to go up
2020-04-22
- 02:15:00 request duration goes through the roof, we start to hit the 5min timeout
- 08:59:00 gitlab-cookbooks/gitlab-prometheus!521 (merged) adds a monitor label to the rule server
- 09:02:00 gitlab-com/runbooks!2140 (diffs) adds a filter to record rules to skip the rule server
- 10:29:00 gitlab-com/runbooks!2142 (merged) improving routing
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)
Edited by Michal Wasilewski