Requires !3106 (merged), !3116 (merged), !3111 (merged), !3117 (merged), !3119 (merged), !3123 (merged)

Introduction

First step toward adding new aggregations, such as per-cluster, per-feature-category, etc, is to simplify the way we handle our existing aggregations.

This is the first step, in simplifying the recording rules for aggregations. The next step will be to simply the Grafonnet libraries uses to display them.

Description

GitLab's SLO monitoring framework relies on SLI metrics collected from multiple Prometheus instances and presented as a single view in Thanos.

This MR a major refactor/rewrite to the way that we aggregate data in our monitoring system.

In order to reduce risk, this refactor was carried out in two steps:

Adjust the "old way" of generating recording rules so that the output is the same as the new way. This was done through many small merge requests: see https://gitlab.com/gitlab-com/runbooks/-/merge_requests?scope=all&utf8=%E2%9C%93&state=all&label_name[]=AggregationSets
Carry out the big refactor (this merge request) while ensuring that the output (ie, the configured YAML), remains the same as before.

Since the YAML configuration is almost identical from before and after this change, we can be confident in this refactor, since we have reduced it from a major configuration change to a non-functional one with only minimal configuration changes.

Reviewers Guide

Review the YAML changes, ie, the "output" of the program. Besides a few recording rules moving location within a file, no new rules should be introduced, removed or have changed expressions. These changes were already made in the previous MRs.
1. Understanding that the output is functionally identical should help reduce the anxiety of reviewing such a big change 😄
Understand the spirit of the change. This change abstracts out the concept of an "aggregation set".
1. An aggregation set is a matrix of the key metric measurement for an SLI - apdex score, apdex weight, error rate, operation rate and error ratio across a series of burn rate intervals (normally, 5m, 30m, 1h and 6h, sometimes also 1m for legacy reasons).
2. Aggregation set have a common set of labels over which they are being aggregated, and a unique name.
3. Aggregation sets also have selector labels which are applied to queries using the aggregation sets, so ensure consistent routing to the appropriate thanos/prometheus instance.
4. The aggregation sets of defined in metrics-catalog/aggregation-sets.libsonnet. Any future aggregation sets will be defined here too.
5. Most importantly, a single set of transformation functions is used to transform one aggregation set into another.
6. This ensures consistency and allows us to add new aggregations without duplicating thousands recording rule names in dozens of places throughout the codebase, as we did before this change.
Understand future changes coming after this.
1. We will add two new aggregation sets: Regional monitoring for zone/region monitoring of the application, and Feature Category aggregations, as the basis of error budgeting.
2. Additionally, we will also retrofit the SLI dashboard components to use aggregation sets. This will reduce the number of components substantially since we won't need apdex, error ratio and operation rate dashboard panels for each different aggregation set. A single set of components will be built and maintained that can we used to display key metric indicators for any set of aggregations.

Edited Jan 22, 2021 by Andrew Newdigate

Aggregation sets

Introduction

Description

Reviewers Guide

Merge request reports