Reliability and Debuggability for monitoring stack

Open Epic created 3 years ago by Michal Wasilewski

DRI: @mwasilewski-gitlab @steveazz

Status 2022-03-11

&530 (comment 871505738) This effort is stalled until a new DRI is assigned.

Previous status updates

2021-07-12 - ideation, addressing urgent problems

2021-08-09 - simplify kubernetes config in tanka-deployments and take ownership of the repo

2021-08-23 - two parallel efforts:

Reenable tracing for Thanos ( &568 (closed) )
migrate Graphana to Kubernetes ( &146 (closed) )

2021-08-30 - Postgres training

2021-09-06 - troubleshooting Prometheus oom kills in production: production#5466 (closed)

2021-09-13 - some more Prometheus troubleshooting

2021-09-20 - I spent most of the week working on Jaeger+ES (in the end we decided to use Stackdriver instead) and troubleshooting issues with Thanos in production https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14222

2021-09-27 - on-call shadowing

2021-10-04 - finishing on-call, recovering from on-call, rolling out os-query to production

2021-10-14 - &530 (comment 704775078)

2021-10-22 - &530 (comment 711332827)

2021-10-29 - &530 (comment 718587703)

2021-11-05 - &530 (comment 725308157)

2021-11-12 - &530 (comment 731973795)

2021-11-19 - &530 (comment 740085992)

2021-11-26 - &530 (comment 745086134)

2021-12-17 - &530 (comment 783787081)

Overview

We need to rebuild domain knowledge (Prometheus, Thanos, ...) in the infrastructure team around the monitoring stack. We also need to become familiar again with the existing setup. Once we have that we should start doing regular maintenance as well as long-term planning (capacity planning, next steps, ...). We will also be able to troubleshoot problems with the monitoring infra and help out the dev teams with monitoring related questions. Initially, it will probably be done mostly through maintenance tasks such as doing version upgrades. In the past, we had a lot of incidents related to Thanos, Grafana, and Prometheus. Not a lot of people know how the setup works since this was done by previous employers.

Mission statement

We need to pay off some technical debt on our monitoring stack we are mainly focusing on 2 things:

reliability: The monitoring stack shouldn't break down constantly, and get in the way for our engineers to debug a problem and should be able to self-heal and scale on demand.
debuggability: When the monitoring stack breaks, the person fixing it should have a tool at their hands to fix the problem, this being runbooks, monitoring for the stack itself, and tracing to understand bottlenecks.

Definition of done

Reference

Edited 2 years ago by John Jarvis

Linked items 0

Link items together to show that they're related or that one is blocking others.

Activity

Please register or sign in to reply