Skip to content

refactor(alerts): lower urgent-cpu-bound execution SLO

Steve Xuereb requested to merge fix/urgent-cpu-execution-slo into master

What

Lower the urgent-cpu-bound execution SLO from 0.995 to 0.99

Why

In https://nonprod-log.gitlab.net/app/r/s/gC1II we see this alert firing a few times, and in https://dashboards.gitlab.net/d/sidekiq-main/sidekiq3a-overview?from=now-7d&to=now&var-environment=gprd&var-stage=main&var-shard=urgent-cpu-bound&orgId=1 we also see the apdex drop every hour. This is a known issue and we are working on fixing this at gitlab-org/gitlab#430782 (closed), but it's causing too many false alerts for the on-call.

Lowering the SLO to 0.99 will make it less sensitive and not page the on-call multiple times a day until the infradev is fixed.

Before After
Screenshot_2023-11-21_at_15.08.47
source
Screenshot_2023-11-21_at_15.07.34
source

Reference: gitlab-com/gl-infra/production#17162 (closed)

Edited by Steve Xuereb

Merge request reports