Record aggregation sets in Mimir from sli_aggregations

#2420 (closed) describes how our source metrics now go through several layers of aggregation in both Prometheus & Thanos-ruler before being used.

This leads to an aggregation chain that as we've discussed in #2445 (closed) causes misalignment of metrics and inaccuracies in ratio calculations. It also makes it harder to scale our Prometheus infrastructure, that is responsible for scraping metrics and evaluation these sometimes high cardinality recording rules.

To work around this, we should try evaluating these rules in Thanos directly, without any intermediate aggregation sets or the recording rule registry.

This means we'll be recording the following aggregations from source metrics, separated by service:

Note: The recordings for these should also include services (thanos & code_suggestions that are currently already globally evaluated

For Service Level Availability & Alerting

componentSLIs: (gitlab_component*)
regionalComponentSLIs: (gitlab_regional*)
nodeComponentSLIs: (gitlab_component_node*)
shardComponentSLIs: (gitlab_component_shard*)

2nd level aggregations:

componentSLIs -- aggregated into --> serviceSLIs (gitlab_service*): This aggregation will need to be used for the Service Level Availability calculation. As far as I know, we only need the 5m burnRate for this for now, this needs to be confirmed.
regionalComponentSLIs -- aggregated into regionalServiceSLIs. This aggregation is used on the regional dashboards. We won't need the aggregation filter for this as we can use the regionalComponentSLIs for this.

For error budgets for stage groups

featureCategorySLIs: (gitlab:component:feature_category:*)

2nd level aggregations

featureCategorySLIs -> serviceComponentStageGroupSLIs
featureCategorySLIs -> stageGroupSLIs

Extra recording rules for sidekiq:

sidekiqWorkerQueueSLIs: gitlab_background_jobs:queue:*

Implementation details

These recordings currently start in Prometheus in rules-jsonnet/service-key-metrics.jsonnet. Where we currently record this for the services that are not globally evaluated.

The equivalent of that for globally evaluated services is in thanos-rules-jsonnet/service-key-metrics.jsonnet

These 2 files will need to be unified, generating recording rule files for all services in Thanos. For this, we could build new aggregation sets for component, node, featureCategory and shard aggregations. Initially these will need different metric names so we don't conflict with the existing recording rules. Later, we'll rename this to be the same as the current aggregations so people, dashboards and alerts can continue working with the aggregations that they always worked with. There's an example in thanos-staging-rules-jsonnet/experimental-transactional-aggregations.jsonnet.

These aggregations are going to use the recording rule registry we're building in #2602 (closed). This means that we will always use sli_aggregations: recording rules directly for all aggregation sets.

For these new aggregations, we'll have to add new recording rules to record error and apdex ratios. For this we can use recording-rules/aggregation-set-apdex-ratio-reflected-rule-set.libsonet and recording-rules/aggregation-set-error-ratio-reflected-rule-set.libsonnet in a first iteration. In the future, we'll start using transactional ratios (#2483).

These aggregations are then later fed into different views, the second level aggregations, this is currently done in thanos-rules-jsonnet/aggregation-set-recording-rules.jsonnet. We'll need to update this to only include what we still need, some of them won't be necessary as they were used to aggregate the prometheus metrics into global views, which we're doing directly in these new aggregation sets.

Goal

At the end of this issue, all recordings for SLI aggregations, error budgets for stage groups and availability happen in Mimir. This means we can remove the rules-jsonnet/service-key-metrics.jsonnet file entirely.

Edited Mar 21, 2024 by Bob Van Landuyt