Aggregate per node slis in thanos

Requires !3117 (merged) and !3123 (merged) and !3133 (merged). Extracted from !3068 (merged), in order to reduce the scale of that change.

We perform several different types of aggregation. Of these, the per-node, per-SLI aggregations, which are only used for Gitaly to perform node-level monitoring on those service, are inconsistent with other aggregations in that:

  1. The aggregation is still performed in Prometheus, instead of Thanos. All other aggregations are now done in Thanos.
    1. This is likely not a problem, but if Gitaly metrics were to be split over multiple Prometheus instances, it would become a problem
  2. Alerting on Gitaly per-node SLO violations still uses generic alerts. For global SLI alert violations, we moved over to specific alerts per SLI a while back, which allows for better descriptions, routing, and alert names
    1. This change was not done for the Gitaly per-node alerts

While none of these changes are urgent, they are technical debt. This change updates the node aggregations by moving them over to Thanos.

Once this change is complete, all aggregation evaluation occurs in a consistent manner, in Thanos Ruler.

This consistency unblocks the aggregation set refacotr in !3068 (merged)

Edited by Andrew Newdigate

Merge request reports

Loading