Record aggregation sets in Mimir from sli_aggregations
#2420 (closed) describes how our source metrics now go through several layers of aggregation in both Prometheus & Thanos-ruler before being used.
This leads to an aggregation chain that as we've discussed in #2445 (closed) causes misalignment of metrics and inaccuracies in ratio calculations. It also makes it harder to scale our Prometheus infrastructure, that is responsible for scraping metrics and evaluation these sometimes high cardinality recording rules.
To work around this, we should try evaluating these rules in Thanos directly, without any intermediate aggregation sets or the recording rule registry.
This means we'll be recording the following aggregations from source metrics, separated by service:
Note: The recordings for these should also include services (thanos
& code_suggestions
that are currently already globally evaluated
For Service Level Availability & Alerting
-
componentSLIs
: (gitlab_component*
) -
regionalComponentSLIs
: (gitlab_regional*
) -
nodeComponentSLIs
: (gitlab_component_node*
) -
shardComponentSLIs
: (gitlab_component_shard*
)
2nd level aggregations:
-
componentSLIs
-- aggregated into -->serviceSLIs
(gitlab_service*
): This aggregation will need to be used for the Service Level Availability calculation. As far as I know, we only need the 5m burnRate for this for now, this needs to be confirmed. -
regionalComponentSLIs
-- aggregated intoregionalServiceSLIs
. This aggregation is used on the regional dashboards. We won't need the aggregation filter for this as we can use theregionalComponentSLIs
for this.
For error budgets for stage groups
-
featureCategorySLIs
: (gitlab:component:feature_category:*
)
2nd level aggregations
-
featureCategorySLIs
->serviceComponentStageGroupSLIs
-
featureCategorySLIs
->stageGroupSLIs
Extra recording rules for sidekiq:
-
sidekiqWorkerQueueSLIs
:gitlab_background_jobs:queue:*
Implementation details
These recordings currently start in Prometheus in rules-jsonnet/service-key-metrics.jsonnet
. Where we currently record this for the services that are not globally evaluated.
The equivalent of that for globally evaluated services is in thanos-rules-jsonnet/service-key-metrics.jsonnet
These 2 files will need to be unified, generating recording rule files for all services in Thanos. For this, we could build new aggregation sets for component
, node
, featureCategory
and shard
aggregations. Initially these will need different metric names so we don't conflict with the existing recording rules. Later, we'll rename this to be the same as the current aggregations so people, dashboards and alerts can continue working with the aggregations that they always worked with. There's an example in thanos-staging-rules-jsonnet/experimental-transactional-aggregations.jsonnet
.
These aggregations are going to use the recording rule registry we're
building in
#2602 (closed). This
means that we will always use sli_aggregations:
recording rules
directly for all aggregation sets.
For these new aggregations, we'll have to add new recording rules to record error and apdex ratios. For this we can use recording-rules/aggregation-set-apdex-ratio-reflected-rule-set.libsonet
and recording-rules/aggregation-set-error-ratio-reflected-rule-set.libsonnet
in a first iteration. In the future, we'll start using transactional ratios (#2483).
These aggregations are then later fed into different views, the second level aggregations, this is currently done in thanos-rules-jsonnet/aggregation-set-recording-rules.jsonnet
. We'll need to update this to only include what we still need, some of them won't be necessary as they were used to aggregate the prometheus metrics into global views, which we're doing directly in these new aggregation sets.
Goal
At the end of this issue, all recordings for SLI aggregations, error budgets for stage groups and availability happen in Mimir. This means we can remove the rules-jsonnet/service-key-metrics.jsonnet
file entirely.