Improve the delay in metrics when viewing global aggregation sets
For some aggregations, we go through several layers of recording rules, each recorded at an interval and taking some time to evaluate. For example, the gitlab_service_apdex:ratio_5m
series goes through these layers for the rails_request
component:
- Prometheus:
sli_aggregations:gitlab_sli_rails_request_total_rate5m
- Prometheus:
gitlab_component_ops:rate_5m{component="rails_request"}
- Thanos:
gitlab_component_ops:rate_5m
Each of these recording rules could potentially add a delay of ~1m based on the interval at which they are evaluated.
In #2174 (closed), we looked at the impact of removing the first step (sli_aggregations:
) for another recording rule chain. But noticed it had a negative impact on all recording rule groups evaluated in that prometheus instance. So we decided against that.
During a call about this on 2023-04-19 (notes), @nduff mentioned we could try expanding these recording rules in Thanos ruler. Each of these recording rules could rely on the source metrics. Theoretically, this would rely mostly on cached data, only needing to fetch the last minute from Prometheus. This would remove all of the delay introduced by the several layers of recording rules. On top of that, thanos-ruler should be easier to scale than Prometheus.
Proposal
We could try this out with an equivalent to the global gitlab_component_*
aggregation set and compare the results of those recordings with the ones using a recording rule chain.
If the recordings solve the delay, and don't cause performance issues for thanos-ruler, thanos-store or thanos-query. We could start a project to migrate all of our Thanos SLI rules to use this approach.