Reliability and Debuggability for monitoring stack
DRI: @mwasilewski-gitlab @steveazz
Status 2022-03-11
&530 (comment 871505738) This effort is stalled until a new DRI is assigned.
Previous status updates
2021-07-12 - ideation, addressing urgent problems
2021-08-09 - simplify kubernetes config in tanka-deployments and take ownership of the repo
2021-08-23 - two parallel efforts:
- Reenable tracing for Thanos ( &568 (closed) )
- migrate Graphana to Kubernetes ( &146 (closed) )
2021-08-30 - Postgres training
2021-09-06 - troubleshooting Prometheus oom kills in production: production#5466 (closed)
2021-09-13 - some more Prometheus troubleshooting
2021-09-20 - I spent most of the week working on Jaeger+ES (in the end we decided to use Stackdriver instead) and troubleshooting issues with Thanos in production https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14222
2021-09-27 - on-call shadowing
2021-10-04 - finishing on-call, recovering from on-call, rolling out os-query to production
2021-10-14 - &530 (comment 704775078)
2021-10-22 - &530 (comment 711332827)
2021-10-29 - &530 (comment 718587703)
2021-11-05 - &530 (comment 725308157)
2021-11-12 - &530 (comment 731973795)
2021-11-19 - &530 (comment 740085992)
2021-11-26 - &530 (comment 745086134)
2021-12-17 - &530 (comment 783787081)
Overview
We need to rebuild domain knowledge (Prometheus, Thanos, ...) in the infrastructure team around the monitoring stack. We also need to become familiar again with the existing setup. Once we have that we should start doing regular maintenance as well as long-term planning (capacity planning, next steps, ...). We will also be able to troubleshoot problems with the monitoring infra and help out the dev teams with monitoring related questions. Initially, it will probably be done mostly through maintenance tasks such as doing version upgrades. In the past, we had a lot of incidents related to Thanos, Grafana, and Prometheus. Not a lot of people know how the setup works since this was done by previous employers.
Mission statement
We need to pay off some technical debt on our monitoring stack we are mainly focusing on 2 things:
- reliability: The monitoring stack shouldn't break down constantly, and get in the way for our engineers to debug a problem and should be able to self-heal and scale on demand.
- debuggability: When the monitoring stack breaks, the person fixing it should have a tool at their hands to fix the problem, this being runbooks, monitoring for the stack itself, and tracing to understand bottlenecks.
Definition of done
Reference
- Good first issues
- Slack channel: #monitoring-reliability
- Roadmap
- Show closed items