Thanos and Prometheus not responding under load
Summary
Memory utilization on the primary gprd Prometheus server is causing it to crash.
Service(s) affected : Grafana, Thanos, Prometheus
Team attribution : Reliability Engineering / Observability
Minutes downtime or degradation :
Timeline
2019-08-14
- 14:14 UTC - Grafana graphs stopped responding as a result of Prometheus crashes
- 14:45 UTC - An MR has been submitted to resize the instances to address the memory usage: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/merge_requests/923
- ...
Edited by AnthonySandoval