Cross tenant error budgets in Mimir
In our Thanos environment, all metrics from all services counted towards error budgets for stage groups as long as they have a feature category on the SLi.
Currently, all of our recordings for error budgets are happening in the giant tenant for gprd
. This will be sufficient in the first iteration as that is where most of the consumers of error budgets for stage groups have their metrics.
However, in #3486 (closed) we're consolidating all metrics from services deployed using Runway into a separate Tenant. This means that for those groups, who currently have SLIs that count toward their error budget will lose this input. So we'll need to bring it back.
Proposal
Keep all the feature category aggregations tenant-local, in each of the tenants that emit the source metrics. The feature category aggregation is used for aggregating all of the source metrics (through sli_aggregations:
).
Then we evaluate all of the recording rules for the stage-group aggregation that the users are viewing on dashboards in a new tenant that can federate queries across the tenants. That way we'll have all of the stage group dashboards query this single tenant delegating only to the underlying tenants when needing more details.
This is currently an experimental Mimir feature: https://grafana.com/docs/enterprise-metrics/latest/tenant-management/tenant-federation/#cross-tenant-alerting-and-recording-rules