Make it possible to create extra aggregations in staging without impacting Production workloads

During the rollout of #2483 in Thanos staging, we introduced unacceptable additional load to the thanos-sidecars running in production. This happened because we duplicated and changed a bunch of recording rules that aggregate high cardinality metrics in thanos-staging. To get the latest metrics from production, thanos-query needs to reach out to all the sidecars, which caused this problem.

We need to get around this to be able to roll out experimental rules in staging without affecting production workloads. But we also need to create enough headroom to be able to roll out the changes safely in production once we get there.

From the a discussion between Nick and Bob on Slack: https://gitlab.slack.com/archives/C05LCN16DB4/p1695787407623389

Nick Duff:  
I had to disable the experimental rules as they were overloading prometheus unfortunately :sadpanda:
https://gitlab.com/gitlab-com/runbooks/-/merge_requests/6388

Nick Duff:  
The thanos sidecar was pretty overwhelmed 

bob:  
Gag, this means that the things we want to try on the recording-rule side of this is effectively blocked until we do something of the other work we planned to do.
Or until we shard Prometheus more, which could be beneficial even
after we start using remote-write for :all_the_things:

bob:  
@stejacks @Matt Smiley  I think we'll need to regroup on this:
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2483#note_1580047942

Nick Duff:  
Sharding more is probably a good step regardless as it needs to happen for our scrape targets alone.
What we would want to do is add a service or component label to our generated Prometheus rules. So we only grab the rules we need for a given Prometheus shard.
Also validating any dependant rules in Prometheus exist in the same rule group if possible

bob:  
So for the rules we'll end up keeping, we already add a type="" static label: https://gitlab.com/gitlab-com/runbooks/blob/13f2fee473bd45fd22a8288570a49abba2a9f81f/rules/autogenerated-key-metrics-web.yml#L164
But what you're saying is that we need to have this on the sli_aggregations:* recording rules as well?

bob:  
Those recording rules would ideally go away when we only evaluate
rules in prometheus.

Nick Duff:  
Just something we can use in the PrometheusRule filters and serviceMonitors as a label selector to pick up rules.
E.g if we put sidekiq into its own prom instance we would want to have
some identifying label for sidekiq so we only scope Prometheus to
import those serviceMonitors and add those PrometheusRules.

Nick Duff: 
So only for “targeting” what each Prometheus shard would do

bob:  
Ah, so we don't run all rules everywhere, or try to gather metrics for a service from a prometheus that doesn't monitor that.

Nick Duff:  
Exactly. The rules are less of an issue as if the metrics don’t exist in a given prom instance/shard they won’t do anything. But better to not process them at all if we can avoid it

Nick Duff:  
I’ll look at the serviceMonitors tomorrow to see what we have in place but spinning up Prometheus instances is easy enough.

bob:  
So shall I add a static type label to all rules in prometheus in the meantime? Would that help?
bob:  
Or is that not where the label would help?

Nick Duff:  
Yes that could be very helpful. It just needs to exist on the PrometheusRule CRD as a label. Not actually in the rules themselves.
Which thinking from my phone is added in CI during deploying :thinking_face:

bob:  
It is, but right now, we aggregate by type, for some rules and expect that to be present everywhere

Nick Duff:  
I wonder if we should be saving the full CRD object like I did with thanos-staging so it’s less hidden in CI ?

bob:  
For example http_requests_total becomes sli_aggregations:http_reqeusts_total. is a rule in the web file, but it's used all over the place: https://gitlab.com/gitlab-com/runbooks/blob/13f2fee473bd45fd22a8288570a49abba2a9f81f/rules/autogenerated-key-metrics-web.yml#L7

bob:  
> I wonder if we should be saving the full CRD object like I did with thanos-staging so it’s less hidden in CI ?
I was thinking the same yesterday, but don't we have prometheus running on VMs in some places still?

Nick Duff:  
Yes good point :sob:

bob:  
So we'd have to check in both versions, if we wanted to do that

Nick Duff:  
We can also use a label to denote rules that should exist on any instance.

bob:  
I like scoping them better, to be honest.

Nick Duff:  
Ok yes that is preferable if doable

bob:  
But if that was an easy thing to do, I'd have done it already
:sweat_smile:

bob:  
I'll look into it briefly, hoping I know more now than I did last time
I looked into this.

Nick Duff:  
Haha :joy:  well let me look at the serviceMonitors tomorrow and we can always at least split up the targets while still deploying all rules. Which could buy us time to decide the right way to do that for now

bob:  
But since in our ideal world we don't have these in between rules at
all, I'm not going to spend too much time on it.

Nick Duff:  
Yea agreed. Let me know your findings anyway

bob:  
Will do, I'll start by trying to write this up in an issue.

Sharding Prometheus

This is probably beneficial, regardless of moving all recording rules into Thanos (#2475 (closed)). As Prometheus still needs to scrape the metrics from services and having less to do in case of the high cardinality services this will buy us more headroom anyway. We also need the headroom if we want to roll out the updated recording rules gradually in production.

To be able to make use of this optimization:

Make the service monitors advertise which services they are scraping, so we don't need to reach out to all sidecars all the time from Thanos
Partition the current rules that we deploy to prometheus so we only evaluate the ones that are needed for a service there. This could prove difficult for the intermediate recording rules, that are often reused by aggregations for different services.
Label the recording rules' CRDs accordingly so they are only deployed where they are needed.

Use remote write for making the metrics available in Thanos staging

This would mean we remove the dependency of thanos-staging on the sidecars in production.

This is where we want to end up for all our Thanos deployments, everything going through remote-write and Prometheus only being responsible for scraping their targets. This could be a long way off, but we could do it for thanos-staging first so we remove this coupling and can continue experimenting.

This does not solve the problem of adding enough headroom to roll out our changes to production which will still be using the thanos-sidecar approach.

Edited Sep 27, 2023 by Bob Van Landuyt