Separate gprd recordings from non-prod recording rule groups in Thanos
The ServiceThanos provides a single pane of glass for all of our metrics at GitLab. This means that metrics from several environments are collected there.
Some recording rules also use the env
& environment
labels in their aggregation: recording metrics across several environments in a single rule group. In other words, they record from several Prometheii (monitor: 'default'
) to have a global view inside a single recording rule. This means we cannot have partial_response_strategy: abort
on those rules: if a Prometheus instance in gstg
would not respond, we'd also skip recording metrics from gprd
.
In this issue, we should generate separate recording rule groups for production & non-production metrics. The following aggregations have been identified to aggregate across environments:
- Service aggregation: https://gitlab.com/gitlab-com/runbooks/blob/cdf899fa29198472701e11748a174f26d3e03ede/thanos-rules/autogenerated-aggregated-service-metrics.yml#L10
- Service-component-regional aggregation: https://gitlab.com/gitlab-com/runbooks/blob/cdf899fa29198472701e11748a174f26d3e03ede/thanos-rules/autogenerated-aggregated-sli-regional-metrics.yml#L10
- Feature category aggregation: https://gitlab.com/gitlab-com/runbooks/blob/cdf899fa29198472701e11748a174f26d3e03ede/thanos-rules/autogenerated-aggregated-feature-category-metrics.yml#L11
- component node aggregation: https://gitlab.com/gitlab-com/runbooks/blob/cdf899fa29198472701e11748a174f26d3e03ede/thanos-rules/autogenerated-aggregated-sli-node-metrics.yml#L11
- stage-group-component aggregation: https://gitlab.com/gitlab-com/runbooks/blob/cdf899fa29198472701e11748a174f26d3e03ede/thanos-rules/autogenerated-aggregated-service-component-stage-group-metrics.yml#L11
- service-Component aggregation: https://gitlab.com/gitlab-com/runbooks/blob/cdf899fa29198472701e11748a174f26d3e03ede/thanos-rules/autogenerated-aggregated-component-metrics.yml#L11
- Gitlab SLI aggregations: https://gitlab.com/gitlab-com/runbooks/blob/cdf899fa29198472701e11748a174f26d3e03ede/thanos-rules/autogenerated-aggregated-component-metrics.yml#L11
- stage-group aggregation: https://gitlab.com/gitlab-com/runbooks/blob/cdf899fa29198472701e11748a174f26d3e03ede/thanos-rules/autogenerated-aggregated-stage-group-metrics.yml#L11
- Sidekiq per-worker alerts: https://gitlab.com/gitlab-com/runbooks/blob/cdf899fa29198472701e11748a174f26d3e03ede/thanos-rules/autogenerated-sidekiq-alerts.yml#L11