Use a Multi Window Multi Burnrate for MTBF measurement.

In a first iteration of MTBF was using slo_observation status to check the status of a service. This meant that the we'd record a failure every time an SLI drops below a relatively loose threshold (0.95 in the case for web).

We concluded in mstaff#17 (closed) that the path forward is not alerting on every single dip below the threshold for a burn rate. But instead having the possibility to specify multiple thresholds.

That would allow us to tighten up the "failure" threshold while leaving the monitoring threshold in place.

Looking at an issue a human say on the mwmbr dashboard in grafana with the monitoring threshold at the time (0.999):

we saw that no burn rates dropped below the thresholds, and there were no alerts, even though the service had been choppy.

Raising the threshold to 0.9995:

https://dashboards.gitlab.net/d/alerts-service_multiburn_apdex/alerts-service-multi-window-multi-burn-rate-apdex-out-of-slo?orgId=1&from=1611734400000&to=1611748800000&var-PROMETHEUS_DS=Global&var-environment=gprd&var-type=web&var-stage=main&var-proposed_slo=0.9995

Would have surfaced the errors. We can't raise the threshold for monitoring, because the alerts would become noise. But for MTBF, I think we could set the thresholds a bit tighter. Illustrating where we want to be. The MTBF would then become: mean time between alerts if we set the thresholds where we wanted them to be.

We already have separate thresholds in use by teamDelivery. They're currently set to the same as the monitoring thresholds. In gitlab-com/runbooks!3219 (merged) I'm working on a way to easily add multiple thresholds, which could then potentially all be used on the MTBF dashboard.

Update

Updated the apdex to 0.998, and lowered the erros 0.999

Edited Feb 24, 2021 by Rachel Nienaber