Skip to content

feat: stop paging on hpa saturation

Steve Xuereb requested to merge feat/stop-hpa-alerts into master

What

Stop paging the on-call engineer on HPA saturation.

Why

This has paged 30 times in the last 3 months and is sometimes it's not actionable or as stated earlier just a symptom where we are already working on the problem because another SLO was fired.

Following the same methodology from My Philosophy on Alerting we shouldn't alert on cause-based alerts, especially if there there is no user impact. Similar to how we no longer alert on high CPU usage if the requests we serve to our users are within the specified SLO.

We can add capacity planning in tamland if we want to get forecasting for HPA saturation.

Reference: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15883

Edited by Steve Xuereb

Merge request reports