fix(alerting) lower web apdex SLO from 99.9% to 99.8%
We've gotten used to tightening our SLOs as the availability and performance of GitLab.com have improved.
Unfortunately, this time, the trend is going the opposite way, and we will be loosening the SLO for web performance.
This is because we are seeing a great deal of slowburn web pagerduty alerts, such as the following:
- gitlab-com/gl-infra/production#3497 (closed)
- gitlab-com/gl-infra/production#3481 (closed)
- gitlab-com/gl-infra/production#3480 (closed)
- gitlab-com/gl-infra/production#3323 (closed)
- gitlab-com/gl-infra/production#3302 (closed)
- gitlab-com/gl-infra/production#3270 (closed)
- https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3149
- gitlab-com/gl-infra/production#3092 (closed)
- gitlab-com/gl-infra/production#3053 (closed)
- https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3051
- https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2934
- https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2931
In many cases, there is little we can do. Many endpoints in the application are slowing down and no longer rendering within the 1s limit required for the "satisfactory" apdex threshold.
Over the last 7 days, these Rails controllers have been the worst offenders:
Lowering the SLO
In the interests of reducing the strain on the oncall operators, for now it makes sense to lower this threshold. If and when these performance issues are addressed, we may be able to tighten the web/apdex SLO once again.
Performance of New Threshold for Historical Incidents
The lower threshold will still accurately detection issues such as this incident on the 16th of January:
Trend
Plotting the web/apdex score over a 3 month period shows how the apdex score is currently degrading, and illustrates how we have now crossed the 99.9% SLO threshold, which explains the constant alerts.