Skip to content
Snippets Groups Projects
Open Reliability and Debuggability for monitoring stack
  • Reliability and Debuggability for monitoring stack

  • Open Epic created by Michal Wasilewski

    DRI: @mwasilewski-gitlab @steveazz

    Status 2022-03-11

    &530 (comment 871505738) This effort is stalled until a new DRI is assigned.

    Previous status updates

    2021-07-12 - ideation, addressing urgent problems

    2021-08-09 - simplify kubernetes config in tanka-deployments and take ownership of the repo

    2021-08-23 - two parallel efforts:

    2021-08-30 - Postgres training

    2021-09-06 - troubleshooting Prometheus oom kills in production: production#5466 (closed)

    2021-09-13 - some more Prometheus troubleshooting

    2021-09-20 - I spent most of the week working on Jaeger+ES (in the end we decided to use Stackdriver instead) and troubleshooting issues with Thanos in production https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14222

    2021-09-27 - on-call shadowing

    2021-10-04 - finishing on-call, recovering from on-call, rolling out os-query to production

    2021-10-14 - &530 (comment 704775078)

    2021-10-22 - &530 (comment 711332827)

    2021-10-29 - &530 (comment 718587703)

    2021-11-05 - &530 (comment 725308157)

    2021-11-12 - &530 (comment 731973795)

    2021-11-19 - &530 (comment 740085992)

    2021-11-26 - &530 (comment 745086134)

    2021-12-17 - &530 (comment 783787081)

    Overview

    We need to rebuild domain knowledge (Prometheus, Thanos, ...) in the infrastructure team around the monitoring stack. We also need to become familiar again with the existing setup. Once we have that we should start doing regular maintenance as well as long-term planning (capacity planning, next steps, ...). We will also be able to troubleshoot problems with the monitoring infra and help out the dev teams with monitoring related questions. Initially, it will probably be done mostly through maintenance tasks such as doing version upgrades. In the past, we had a lot of incidents related to Thanos, Grafana, and Prometheus. Not a lot of people know how the setup works since this was done by previous employers.

    Mission statement

    We need to pay off some technical debt on our monitoring stack we are mainly focusing on 2 things:

    • reliability: The monitoring stack shouldn't break down constantly, and get in the way for our engineers to debug a problem and should be able to self-heal and scale on demand.
    • debuggability: When the monitoring stack breaks, the person fixing it should have a tool at their hands to fix the problem, this being runbooks, monitoring for the stack itself, and tracing to understand bottlenecks.

    Definition of done

    Reference

    Edited by John Jarvis

    Linked items 0

  • Link items together to show that they're related or that one is blocking others.

    Activity

    • All activity
    • Comments only
    • History only
    • Newest first
    • Oldest first