Aggregation sets use a unified, scoped recording rule registry

SLI aggregations are recording rules we added to Prometheus for high cardinality metrics.

These are generated through the recording-rule-registry. The metric names start with sli_aggregations:. Whenever a service adds a metric in recordingRuleMetrics for the definition, the autogenerated-key-metrics file in Prometheus will contain these recording rules. However, we deploy these rules to all Prometheus deployments, and the recording rule is not scoped by service. This means that the recording rules for http_requests_total in one service, would be used across all services.

We currently don't have these metrics in Thanos environments at all. We cannot record all of the high cardinality aggregations directly from source metrics without scoping them. If we don't too many metrics need to be loaded at once by thanos-query, causing it to run out of memory. Furthermore, we need a way to shard these recordings even further as the size of our fleet grows.

Proposal: Unified and scoped recording rule registry

In #2599 (comment 1647759195) we discuss the usage of a new recording rule registry.

It is unified: The new recording rule registry contains recording rules for all metrics used in SLIs. These metrics are always used for other aggregations and in dashboards. Metrics don't need to be specified in the recordingRuleMetrics field of a service.

It is scoped: The sli_aggregation: recording rules should be recorded by environment (gprd, gstg, ...), type (web, api, ...), cluster (gprd-us-east1-b, gprd-us-east1-c, ...). This should be done by separating these rules into separate files (fe gprd/web/gprd-us-east1-b/sli_aggregations.yml). This will allow us to add extra scopes when our fleet grows even more in the future.

The goal is to be able to use this sli_aggregations: recording rule everywhere for aggregation sets and on dashboards. This means that it needs to have the appropriate aggregation labels. These would be:

The significantLabels defined on an SLI. All labels used in dashboards or elsewhere should be added to the significant labels. We should assert this when trying to use one of the SLI aggregations.
The labels defined on all aggregation-sets. This could result in too many labels, this is not a problem as the empty labels won't add anything to the recording rule.
All the selector labels used in the SLI definition (used in selector= argument of the SLI functions rate, histogramApdex...)

These files should have a recording rule group per burn-rate that we record for.

We should use this new recording-rule registry in aggregation sets that are recorded from source metrics in Thanos

Progress

Enforce selectors in SLIs defined as objects gitlab-com/runbooks!6700 (merged)
API to gather all metrics, labels and selectors needed for a given SLI gitlab-com/runbooks!6612 (merged)
Implement method to regexp-escape strings gitlab-com/runbooks!6816 (merged)
Generate the new sli_aggregations: recording rules across all SLIs gitlab-com/runbooks!6711 (merged)
Implement the new recording rule registry and record one aggregation set gitlab-com/runbooks!6874 (merged)

Edited Feb 28, 2024 by Gregorius Marco