Improve the delay in metrics when viewing global aggregation sets

For some aggregations, we go through several layers of recording rules, each recorded at an interval and taking some time to evaluate. For example, the gitlab_service_apdex:ratio_5m series goes through these layers for the rails_request component:

Prometheus: sli_aggregations:gitlab_sli_rails_request_total_rate5m
Prometheus: gitlab_component_ops:rate_5m{component="rails_request"}
Thanos: gitlab_component_ops:rate_5m

Each of these recording rules could potentially add a delay of ~1m based on the interval at which they are evaluated.

In #2174 (closed), we looked at the impact of removing the first step (sli_aggregations:) for another recording rule chain. But noticed it had a negative impact on all recording rule groups evaluated in that prometheus instance. So we decided against that.

During a call about this on 2023-04-19 (notes), @nduff mentioned we could try expanding these recording rules in Thanos ruler. Each of these recording rules could rely on the source metrics. Theoretically, this would rely mostly on cached data, only needing to fetch the last minute from Prometheus. This would remove all of the delay introduced by the several layers of recording rules. On top of that, thanos-ruler should be easier to scale than Prometheus.

Proposal

We could try this out with an equivalent to the global gitlab_component_* aggregation set and compare the results of those recordings with the ones using a recording rule chain.

If the recordings solve the delay, and don't cause performance issues for thanos-ruler, thanos-store or thanos-query. We could start a project to migrate all of our Thanos SLI rules to use this approach.

Edited Jul 06, 2023 by Bob Van Landuyt