SLO alerting for Sidekiq workers
Related to gitlab-com/gl-infra/scalability#175 (closed)
SLO Metrics
This change adds SLOs for three metrics, these metrics are:
- Queue time
- Execution time
- Execution error rate
As a starting point, we will go with a 99% SLO, over a one month period. Ideally over time we would like to add more nines on the end of this, but lets start here.
The alerting uses the two standard burn windows across four rates: 1h/5m
and 6h/30m
.
Minimum RPS
Unfortunately, at present, many workers have terrible metrics. Most of these workers are low-rate jobs that don't have enough activity to alert on reliably.
For this iteration, we'll start off by ignoring any job, that, over a 6 hour period is called on average less than 4 times a minute.
In future iterations, we may be able to group these low-rate jobs into bundles with enough activity. If we do this, grouping by feature_category
, so that the relevant team can still be notified of SLO breaches would possibly be a good option.
Alerts
At present, the alerts will only be sent to #alerts-general
in Slack.
Note that the alerts will contain a feature_category
label, making it possible to route the alerts to the relevant teams.
Once we feel that the alerts are working well, we can route them to pagerduty.
Persistent Offenders
There are several jobs that are clearly going to alert a lot:
-
pages_domain_ssl_renewal
feature category:pages
(cc @jhampton) has consistent error rates up to 40%: https://dashboards.gitlab.net/d/sidekiq-queue-detail?var-queue=pages_domain_ssl_renewal -
emails_on_push
feature category:source_code_management
(cc @m_gill) consistently takes longer than the 10s allocated to urgent jobs https://dashboards.gitlab.net/d/sidekiq-queue-detail?var-queue=emails_on_push -
reactive_caching
feature categorynot_owned
(cc me) consistently takes longer than the 10s allocated to urgent jobs https://dashboards.gitlab.net/d/sidekiq-queue-detail?var-queue=reactive_caching -
deployment:deployments_forward_deployment
feature categorycontinuous_delivery
(cc @csouthard) consistently takes longer than 10m to run, exceeding the 5m allocated for non-urgent jobs https://dashboards.gitlab.net/d/sidekiq-queue-detail?var-queue=deployment:deployments_forward_deployment -
project_export
feature category:importers
(cc @lmcandrew) frequently queues for over 10m, exceeding the 1 minute allocated for low-urgency jobs. Perhaps we should changeproject_export
tourgency: none
? https://dashboards.gitlab.net/d/sidekiq-queue-detail?var-queue=project_export Proposal added gitlab-com/gl-infra/scalability#217 (closed) -
authorized_projects
feature category:authentication_and_authorization
. This worker is a bit of a mess right now. https://dashboards.gitlab.net/d/sidekiq-queue-detail?var-queue=authorized_projects
I propose that while we tackle these problems, we apply silences for them in AlertManager.