SLO Monitoring: generate an alerting rule for each SLI
cc @cmiskell from our discussion on this
Part of gitlab-com/gl-infra&358
This change replaces the generic component_error_ratio_burn_rate_slo_out_of_bounds_upper and component_apdex_ratio_burn_rate_slo_out_of_bounds_lower alerts, which broadly translate into "An SLI is outside of SLO" with a specific alert per SLI.
This approach has many advantages:
- When appropriate, the alert now contains stage group and feature_category information
- We link directly to the service overview dashboard from the alert, instead of linking to the generic SLO violation dashboard
- The multiwindow, multi-burn-rate alerting expressions can now be hardcoded with the SLO values, making the expressions easier to understand (and faster to execute)
- Each alert now contains details of the SLI being monitored
Obviously, the one downside of this approach is that we have more alerts, but since we're generating them, this should be fine.
Future Improvements
Going forward, this approach will open up many new possibilities, for instance:
- Team-specific routing on SLO violation alerts
- SLO specific runbooks for specific condition,s rather than generic runbooks
- Simpler approach to overriding SLOs on a per-SLI basis, ala https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11923
- Each alert will have a recognisable name, instead of the generic
component_error_ratio_burn_rate_slo_out_of_bounds_upperandcomponent_apdex_ratio_burn_rate_slo_out_of_bounds_lowernames.- The new names will be something like
WebServicePumaApdexSLOViolationorAPIServiceWorkhorseErrorSLOViolationwhich is easier to parse. - This change is trivial, but I'd prefer to leave it for a future MR as we need to carefully migrate silences, so for now, I'll leave these names as-is.
- The new names will be something like
Edited by Andrew Newdigate