[Incident Review] 2020-04-22 Lack of Observability

Summary

Prometheus and alertmanager were still operational, Thanos was completely unavailable and because of that also Grafana

no alerts were triggered:
- we could use Elastic watcher http input to perform a simple check if Prometheus, Alertmanager, Thanos, Grafana are operational. @bjk-gitlab do we already have any alerting for the "monitoring stack" external to it?
what steps were taken to troubleshoot it? (how can we improve time to detection)
- we have this dashboard for the monitoring components: https://dashboards.gitlab.net/d/monitoring-main/monitoring-overview?orgId=1&from=now-6h%2Fm&to=now%2Fm&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2 . However, a lot of the charts are empty. Besides, these should be monitored by an external system
- We have Thanos sending tracing info to Elastic APM
- The only logs available in Kibana are from Prometheus: https://log.gprd.gitlab.net/goto/db7eca8ec4cbd80b2ae1187715e62984
what would have prevented it from happening?
- the change was reviewed, we simply missed the fact that it resulted in a circular dependency

Who was impacted by this incident? All employees attempting to use dashboards.gitlab.net and the infrastructure department's monitoring systems.
What was the customer experience during the incident? None
How many customers were affected? None
If a precise customer impact number is unknown, what is the estimated potential impact? n/a

How was the event detected? Grafana dashboards were timing out.
How could detection time be improved? Monitoring of the Grafana system or latency for Thanos queriies could have alerted us.
How did we reach the point where we knew how to mitigate the impact?
How could time to mitigation be improved?

How was the root cause diagnosed?
How could time to diagnosis be improved?
Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident? Yes, &159.
Was this incident triggered by a change (deployment of code or change to infrastructure. if yes, have you linked the issue which represents the change?)? Yes, gitlab-com/runbooks!2135 (merged).

All times UTC.

2020-04-21

19:00:00 gitlab-com/runbooks!2135 (merged) change is made to rules which results in a circular dependency in Thanos
23:40:00 transaction durations start to go up

2020-04-22

02:15:00 request duration goes through the roof, we start to hit the 5min timeout
08:59:00 gitlab-cookbooks/gitlab-prometheus!521 (merged) adds a monitor label to the rule server
09:02:00 gitlab-com/runbooks!2140 (diffs) adds a filter to record rules to skip the rule server
10:29:00 gitlab-com/runbooks!2142 (merged) improving routing

investigate and fix monitoring related metrics which are unavailable in Grafana
- issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10027
add an Alertmanager alert for Thanos latency (might be a matter of defining SLO threshold)
- there's already an epic: &159
add Alertmnager alerts for other components of the monitoring stack
- same as above
add Elastic watches for alerting when the monitoring stack is down
- same as above
start using Jaeger/Elastic APM for tracing Prometheus/Alertmanager/Grafana
- support for Jaeger in Prometheus is coming in next release
send logs from Thanos, alertmanager, Grafana to Kibana
- they simply don't log a lot
set up a staging environment for our monitoring stack so that we can test changes before they go to production (I think that this would be an overkill, we should focus on better monitoring and alerting instead)
- we should do it as part of the migration to Kubernetes, creation of the staging env will be easier when not done with Chef and will help the migration itself as well

Edited May 30, 2020 by AnthonySandoval