Skip to content

Push error ratio SLA for pages up to 4 nines

Corrective action for gitlab-com/gl-infra/production#2379 (closed)

With the existing 99.95% SLA, it took about 10 minutes for us to use our 1h error budget during the regression introduced in todays deployment:

image

We can tune the error budget for this service to 99.99%:"

With a 99.99% error rate SLA, no alerts over the past two weeks:

image

With the 99.99% error budget, the alert would have fired 3 minutes after the errors started occurring, at 14h22:

image

Error burn rate dashboard link

https://dashboards.gitlab.net/d/alerts-component_multiburn_error/alerts-mwmbr-error-rate-sla-monitoring?orgId=1&from=1594131004508&to=1594131865914&var-PROMETHEUS_DS=Global&var-environment=gprd&var-type=web-pages&var-stage=main&var-component=loadbalancer&var-proposed_slo=0.9999

Edited by Andrew Newdigate

Merge request reports