Rethink capacity planning on resources managed by KEDA--how to measure saturation in such a case?

Spawned from slack thread https://gitlab.slack.com/archives/C04MH2L07JS/p1701101981838699

TLDR

KEDA dynamically scales resources up and down given a set of conditions such as queue size, historical scaling, and CPU utilization. Our current kube_horizontalpodautoscaler_desired_replicas promql calculates a ratio between desired and max replicas, however, if both desired and max replicas are increasing and decreasing dynamically over time, this query tends to a flat line. Given the resources are being scaled up and down correctly, it means the services will always operate at capacity since resources are adjusted as needed to accommodate the load.

Parts of the discussion extracted from the slack thread

@stejacks-gitlab Hey folks, I've tagged a number of people on issues, but we presently have 9 capacity planning issues open related to Sidekiq -- https://gitlab.com/gitlab-com/gl-infra/capacity-planning-trackers/gitlab-com/-/boards/2816983?label_name[]=Service%3A%3ASidekiq -- it'd be good for someone to take a look before we get even more sidekiq incident related.

@cmcfarland This is a hunch, and I need to run down the specifics. But many of our sidekiq shards now scale based on queue size and historical scaling (from the past week) in addition to the normal CPU utilization scaling. This has created extremely spiky HPA scaling. I was really just wondering if there was anyone who has looked into this yet. Graph for example:

@abrandl If the autoscaling works perfectly, saturation becomes a flat line, doesn't it?

@hmerscher I guess if the dynamic scaling is taking time to react to load, that would mean it stays saturated for some time, then it goes down, and the cycle repeats eventually.

This is not bad per se--unless the services are degraded--but means that it could be operating at 80-100 capacity most of the time, which would explain the saturation forecast hitting the ceiling always.

@cmcfarland It actually scales up in anticipation of work, then stays high due to CPU use, then drops off from lack of CPU demand. The big change is that we don't scale up slowly based on cpu measurements anymore. It's driven also by other metrics which, as I understand this, results in much larger swings in the replica count.

Does Tamland not allow for a promql query that might take an average or mean over an hour? To help better predict growth over time versus spikes in use?

@hmerscher it does, and here is the promql for kube_horizontalpodautoscaler_desired_replicas

@abrandl We can switch it to use quantile95_1w for those saturation points, per default we use p95/1h only https://gitlab.com/gitlab-com/gl-infra/capacity-planning-trackers/gitlab-com/-/blob/e7705a91768e2de736b3909023939871f6a23aca/manifest.json#L15

Edited Jan 04, 2024 by Andreas Brandl