Stop paging on HPA saturation
Overview
At the moment we page our SRE on-call when we start saturating HPA. This can be for multiple reasons, legit service growth or just a symptom of a large scale attack or something wrong with the application.
This has paged 30 times in the last 3 months and is sometimes it's not actionable or as stated earlier just a symptom where we are already working on the problem because another SLO fired.
Proposal
We should stop paging on HPA saturation because even if we saturate the HPA the application can still behave and work as expected, our users doesn't care if HPA is saturated.
This is following the same methodology as to why we don't alert on high CPU usage on a machine if there is no customer impact we shouldn't alert on it. We can either move this alert into a Slack channel or make it part of our capacity planning workflow.
Outcome
-
Don't page on HPA saturation 👉 gitlab-com/runbooks!4741 (merged) -
Include api
/git
/web
in tamland forcasting👉