Improve scaling for Prometheus agent

Currently our Prometheus instances are deployed as a single (HA) monolithic deployment per Kubernetes cluster.

To date the only way we have been able to deal with scaling requirements is to scale vertically by assigning more resources.

However this is not a good approach and leaves us susceptible to several issues:

Target Discovery can struggle to keep up as cluster workloads increase during peak periods.
Inconsistent scrape timings can cause jitter
No ability to autoscale as workloads change (affects cost optimisation as well)
A crash or restart can take extended periods of time to replay WAL, leaving us running with no redundancy for long periods.
Large over-provisioning is generally required to deal with the above but suffers greatly for diminishing returns the larger the workload gets.

We should aim to improve our resilience and tolerance to failure here, as well as split up the effective workloads across many pods instead of a single monolithic one.

There are several options in Prometheus to do target sharding across multiple pods, however they all have draw backs and limitations including things like no target resharding during downscaling operations, which makes autoscaling a challenge. Agent mode can also be configured as a daemonset with target scraping limited to node local pods/containers, and this is in fact how google deploy their own managed prometheus.

For this we should be at least testing two options:

Grafana Alloy as a drop in replacement for Prometheus Agent.
Prometheus Agent as a Daemonset.

Both carry pros and cons but they are the two best solutions currently to breakup the prometheus workload in our large production clusters.

Edited Dec 02, 2024 by Nick Duff