Skip to content

fix(kube): relax cluster_scaleup SLO

Steve Xuereb requested to merge fix/cluster_scaleup into master

What

Reduce the cluster_scaleup SLO to be 0.9 instead of 0.9995

Why

In https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6938 we got paged because 1 specific cluster failed because an instance timed out, this is not very actionable.

After looking at the SLI in gitlab-com/gl-infra/production#6943 (comment 930107773) we see that it's a very low "traffic" service, where 1 error burns through both the 1h window and 6h window.

Failing to scale up 1 node pool once is not a big problem since we have retry mechanisims and different zone as well, however if we start having a large failure we still should get alerted.

Reference: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15665

Edited by Steve Xuereb

Merge request reports