Andrew Newdigate requested to merge lower-web-apdex into master Feb 04, 2021

We've gotten used to tightening our SLOs as the availability and performance of GitLab.com have improved.

Unfortunately, this time, the trend is going the opposite way, and we will be loosening the SLO for web performance.

This is because we are seeing a great deal of slowburn web pagerduty alerts, such as the following:

In many cases, there is little we can do. Many endpoints in the application are slowing down and no longer rendering within the 1s limit required for the "satisfactory" apdex threshold.

Over the last 7 days, these Rails controllers have been the worst offenders:

https://docs.google.com/spreadsheets/d/1K03lcU1sfd9ZBd0cFUY1ONoUbbwLpvpmuC7uPOHfNEI/edit#gid=190121349

Lowering the SLO

In the interests of reducing the strain on the oncall operators, for now it makes sense to lower this threshold. If and when these performance issues are addressed, we may be able to tighten the web/apdex SLO once again.

Performance of New Threshold for Historical Incidents

The lower threshold will still accurately detection issues such as this incident on the 16th of January:

https://dashboards.gitlab.net/d/alerts-service_slo_apdex/alerts-global-service-aggregated-metrics-apdex-slo-analysis?orgId=1&from=1610832406700&to=1610852545804&var-PROMETHEUS_DS=Global&var-environment=gprd&var-type=web&var-stage=cny&var-proposed_slo=0.998

Trend

Plotting the web/apdex score over a 3 month period shows how the apdex score is currently degrading, and illustrates how we have now crossed the 99.9% SLO threshold, which explains the constant alerts.

source

Edited Feb 04, 2021 by Andrew Newdigate

fix(alerting) lower web apdex SLO from 99.9% to 99.8%

Lowering the SLO

Performance of New Threshold for Historical Incidents

Trend

Merge request reports