Skip to content

[Incident Review] 2020-04-22 Lack of Observability

Incident: production#1978 (closed)

Summary

Prometheus and alertmanager were still operational, Thanos was completely unavailable and because of that also Grafana

  1. Service(s) affected : ServiceThanos
  2. Team attribution : ~"team::Observability"
  3. Minutes downtime or degradation : ~600

Customer Impact

  1. Who was impacted by this incident? All employees attempting to use dashboards.gitlab.net and the infrastructure department's monitoring systems.
  2. What was the customer experience during the incident? None
  3. How many customers were affected? None
  4. If a precise customer impact number is unknown, what is the estimated potential impact? n/a

Incident Response Analysis

  1. How was the event detected? Grafana dashboards were timing out.
  2. How could detection time be improved? Monitoring of the Grafana system or latency for Thanos queriies could have alerted us.
  3. How did we reach the point where we knew how to mitigate the impact?
  4. How could time to mitigation be improved?

Post Incident Analysis

  1. How was the root cause diagnosed?
  2. How could time to diagnosis be improved?
  3. Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident? Yes, &159.
  4. Was this incident triggered by a change (deployment of code or change to infrastructure. if yes, have you linked the issue which represents the change?)? Yes, gitlab-com/runbooks!2135 (merged).

Timeline

All times UTC.

2020-04-21

2020-04-22

5 Whys

Lessons Learned

Corrective Actions

  • investigate and fix monitoring related metrics which are unavailable in Grafana
  • add an Alertmanager alert for Thanos latency (might be a matter of defining SLO threshold)
    • there's already an epic: &159
  • add Alertmnager alerts for other components of the monitoring stack
    • same as above
  • add Elastic watches for alerting when the monitoring stack is down
    • same as above
  • start using Jaeger/Elastic APM for tracing Prometheus/Alertmanager/Grafana
    • support for Jaeger in Prometheus is coming in next release
  • send logs from Thanos, alertmanager, Grafana to Kibana
    • they simply don't log a lot
  • set up a staging environment for our monitoring stack so that we can test changes before they go to production (I think that this would be an overkill, we should focus on better monitoring and alerting instead)
    • we should do it as part of the migration to Kubernetes, creation of the staging env will be easier when not done with Chef and will help the migration itself as well

Guidelines

Edited by AnthonySandoval