Skip to content

feat: allow SLIs to use confidence intervals in alerting instead of absolute ratios

Andrew Newdigate requested to merge add-confidence-interval-alerting into master

Part of #153 and https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/issues/4774

👈 Requires !7336 (merged)


This builds on !7336 (merged), allowing confidence intervals to be configured for use in SLI alerting.

This will hopefully help with reducing low RPS noise and false positive alerts.

How does it work?

In !7336 (merged), we added new recording rules for calculating confidence intervals for our SLIs, but these recording rules are not yet used anywhere.

This change allows individual SLIs to be configured to use recording rules.

This can be done by adding the following to the SLI definition:

        useConfidenceLevelForSLIAlerts: '98%',

This has been added for two SLIs:

  1. In the Reference Architecture, the workhorse SLI on the webservice service. This is a noisy endpoint which flaps during low-RPS periods.
  2. For GitLab (mimir evaluation only): the rails_requests SLI on the websockets service. This is a very low RPS endpoint, so useful for testing.

At a later stage, it might make sense for this behaviour to become the default, but for now, we'll test on a small number of SLIs to get an idea of its performance.

With this change in place, the alert is switched from using the SLI recording rule, such as:

gitlab_component_errors:ratio_30m{component="workhorse",type="webservice"}

to the confidence interval equivalent:

gitlab_component_errors:confidence:ratio_30m{component="workhorse",confidence="98%",type="webservice"}

An additional confidence label is also added to the alert, for analysis later on to determine whether this change is effective.

Edited by Andrew Newdigate

Merge request reports