SLO Monitoring: generate an alerting rule for each SLI (!2972) · Merge requests · GitLab.com / Runbooks

cc @cmiskell from our discussion on this 😄

This change replaces the generic component_error_ratio_burn_rate_slo_out_of_bounds_upper and component_apdex_ratio_burn_rate_slo_out_of_bounds_lower alerts, which broadly translate into "An SLI is outside of SLO" with a specific alert per SLI.

This approach has many advantages:

When appropriate, the alert now contains stage group and feature_category information
We link directly to the service overview dashboard from the alert, instead of linking to the generic SLO violation dashboard
The multiwindow, multi-burn-rate alerting expressions can now be hardcoded with the SLO values, making the expressions easier to understand (and faster to execute)
Each alert now contains details of the SLI being monitored

Obviously, the one downside of this approach is that we have more alerts, but since we're generating them, this should be fine.

Future Improvements

Going forward, this approach will open up many new possibilities, for instance:

Team-specific routing on SLO violation alerts
SLO specific runbooks for specific condition,s rather than generic runbooks
Simpler approach to overriding SLOs on a per-SLI basis, ala https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11923
Each alert will have a recognisable name, instead of the generic component_error_ratio_burn_rate_slo_out_of_bounds_upper and component_apdex_ratio_burn_rate_slo_out_of_bounds_lower names.
1. The new names will be something like WebServicePumaApdexSLOViolation or APIServiceWorkhorseErrorSLOViolation which is easier to parse.
2. This change is trivial, but I'd prefer to leave it for a future MR as we need to carefully migrate silences, so for now, I'll leave these names as-is.

Edited Nov 23, 2020 by Andrew Newdigate

SLO Monitoring: generate an alerting rule for each SLI

Future Improvements

Merge request reports