Skip to content

Violator: allow SLO violations to be silenced through temporary SLO threshold adjustments

This idea came out of a discussion with @cmiskell.

At present, each service has a single, fixed SLO threshold.

Sometimes, a single SLI will violate the service SLO, or a stage will violate the SLO due to known issue. Examples of this include scalability#619 and https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/470.

The current process is:

  1. Create an issue about the known violation.
  2. Create a silence in AlertManager
  3. Helicopter will then remind subscribers on the issue of the silence through periodic noticiations, eg scalability#619 (comment 447737002)

This approach works reasonably well, but leaves a lot of room for improvements, particularly since these silences are binary. For example, when we silence latency SLI alerts for Gitaly Canary, we will not receive alerts, no matter how much worst things get.

Proposal

Allow SLO thresholds to be temporarily overridden, up until a fixed expiry date, with an adjusted threshold.

For example, for Gitaly Canary, instead of silencing all alerts, we apply an override, temporarily lowering the SLO for this stage of the service from 99.95% to 99.5%.

If service levels exceed the new threshold, alerts will fire. This is an improvement on the current approach in which all alerts are silenced.

Implementation

(Suggested MVC, alternative proposals welcomed)

Violations are maintained in a YAML file, with a format along these lines:

- type: gitaly
  env: gprd  # production
  stage: cny # canary stage
  sli: server 
  apdex_slo: 99.5%
  expiry_date: 2020-11-26
  issue: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/619

A CI job runs periodically and exports all non-expired overrides to Prometheus, via the push-gateway.

The series could look something like:

gitlab_apdex_slo_override{type="gitaly", stage="cny", component="server"} 0.995 

These values could then be incorporated into alerting rules.

Helicopter would then include any SLO overrides in the notifications on the issues, to remind issue subscribers that the violation override is still in place.

What's with the name?

Violator is for dealing with SLO violations. Also, this:

"Enjoy the Silence" is a song by English electronic music band Depeche Mode. Recorded in 1989, it was released as the second single from their seventh studio album, Violator (1990), on 5 February 1990

https://en.wikipedia.org/wiki/Enjoy_the_Silence