fix(kube): relax cluster_scaleup SLO
What
Reduce the cluster_scaleup
SLO to be 0.9
instead of 0.9995
Why
In https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6938 we got paged because 1 specific cluster failed because an instance timed out, this is not very actionable.
After looking at the SLI in gitlab-com/gl-infra/production#6943 (comment 930107773) we see that it's a very low "traffic" service, where 1 error burns through both the 1h window and 6h window.
Failing to scale up 1 node pool once is not a big problem since we have retry mechanisims and different zone as well, however if we start having a large failure we still should get alerted.
Reference: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15665
Edited by Steve Xuereb