feat: allow SLIs to use confidence intervals in alerting instead of absolute ratios
Part of #153 and https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/issues/4774
This builds on !7336 (merged), allowing confidence intervals to be configured for use in SLI alerting.
This will hopefully help with reducing low RPS noise and false positive alerts.
How does it work?
In !7336 (merged), we added new recording rules for calculating confidence intervals for our SLIs, but these recording rules are not yet used anywhere.
This change allows individual SLIs to be configured to use recording rules.
This can be done by adding the following to the SLI definition:
useConfidenceLevelForSLIAlerts: '98%',
This has been added for two SLIs:
- In the Reference Architecture, the
workhorse
SLI on thewebservice
service. This is a noisy endpoint which flaps during low-RPS periods. - For GitLab (mimir evaluation only): the
rails_requests
SLI on thewebsockets
service. This is a very low RPS endpoint, so useful for testing.
At a later stage, it might make sense for this behaviour to become the default, but for now, we'll test on a small number of SLIs to get an idea of its performance.
With this change in place, the alert is switched from using the SLI recording rule, such as:
gitlab_component_errors:ratio_30m{component="workhorse",type="webservice"}
to the confidence interval equivalent:
gitlab_component_errors:confidence:ratio_30m{component="workhorse",confidence="98%",type="webservice"}
An additional confidence
label is also added to the alert, for analysis later on to determine whether this change is effective.