[Incident Review] 2020-04-22 Lack of Observability
Incident: production#1978 (closed)
Summary
Prometheus and alertmanager were still operational, Thanos was completely unavailable and because of that also Grafana
- no alerts were triggered:
- we could use Elastic watcher http input to perform a simple check if Prometheus, Alertmanager, Thanos, Grafana are operational. @bjk-gitlab do we already have any alerting for the "monitoring stack" external to it?
- what steps were taken to troubleshoot it? (how can we improve time to detection)
- we have this dashboard for the monitoring components: https://dashboards.gitlab.net/d/monitoring-main/monitoring-overview?orgId=1&from=now-6h%2Fm&to=now%2Fm&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2 . However, a lot of the charts are empty. Besides, these should be monitored by an external system
- We have Thanos sending tracing info to Elastic APM
- The only logs available in Kibana are from Prometheus: https://log.gprd.gitlab.net/goto/db7eca8ec4cbd80b2ae1187715e62984
- what would have prevented it from happening?
- the change was reviewed, we simply missed the fact that it resulted in a circular dependency
- Service(s) affected : ServiceThanos
- Team attribution : ~"team::Observability"
- Minutes downtime or degradation : ~600
Customer Impact
- Who was impacted by this incident? All employees attempting to use dashboards.gitlab.net and the infrastructure department's monitoring systems.
- What was the customer experience during the incident? None
- How many customers were affected? None
- If a precise customer impact number is unknown, what is the estimated potential impact? n/a
Incident Response Analysis
- How was the event detected? Grafana dashboards were timing out.
- How could detection time be improved? Monitoring of the Grafana system or latency for Thanos queriies could have alerted us.
- How did we reach the point where we knew how to mitigate the impact?
- How could time to mitigation be improved?
Post Incident Analysis
- How was the root cause diagnosed?
- How could time to diagnosis be improved?
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident? Yes, &159.
- Was this incident triggered by a change (deployment of code or change to infrastructure. if yes, have you linked the issue which represents the change?)? Yes, gitlab-com/runbooks!2135 (merged).
Timeline
All times UTC.
2020-04-21
- 19:00:00 gitlab-com/runbooks!2135 (merged) change is made to rules which results in a circular dependency in Thanos
- 23:40:00 transaction durations start to go up
2020-04-22
- 02:15:00 request duration goes through the roof, we start to hit the 5min timeout
- 08:59:00 gitlab-cookbooks/gitlab-prometheus!521 (merged) adds a monitor label to the rule server
- 09:02:00 gitlab-com/runbooks!2140 (diffs) adds a filter to record rules to skip the rule server
- 10:29:00 gitlab-com/runbooks!2142 (merged) improving routing
5 Whys
Lessons Learned
Corrective Actions
- investigate and fix monitoring related metrics which are unavailable in Grafana
- add an Alertmanager alert for Thanos latency (might be a matter of defining SLO threshold)
- there's already an epic: &159
- add Alertmnager alerts for other components of the monitoring stack
- same as above
- add Elastic watches for alerting when the monitoring stack is down
- same as above
- start using Jaeger/Elastic APM for tracing Prometheus/Alertmanager/Grafana
- support for Jaeger in Prometheus is coming in next release
- send logs from Thanos, alertmanager, Grafana to Kibana
- they simply don't log a lot
- set up a staging environment for our monitoring stack so that we can test changes before they go to production (I think that this would be an overkill, we should focus on better monitoring and alerting instead)
- we should do it as part of the migration to Kubernetes, creation of the staging env will be easier when not done with Chef and will help the migration itself as well
Guidelines
Edited by AnthonySandoval