SLO alerting for Sidekiq workers
Related to gitlab-com/gl-infra/scalability#175
SLO Metrics
This change adds SLOs for three metrics, these metrics are:
- Queue time
- Execution time
- Execution error rate
As a starting point, we will go with a 99% SLO, over a one month period. Ideally over time we would like to add more nines on the end of this, but lets start here.
The alerting uses the two standard burn windows across four rates: 1h/5m and 6h/30m.
Minimum RPS
Unfortunately, at present, many workers have terrible metrics. Most of these workers are low-rate jobs that don't have enough activity to alert on reliably.
For this iteration, we'll start off by ignoring any job, that, over a 6 hour period is called on average less than 4 times a minute.
In future iterations, we may be able to group these low-rate jobs into bundles with enough activity. If we do this, grouping by feature_category, so that the relevant team can still be notified of SLO breaches would possibly be a good option.
Alerts
At present, the alerts will only be sent to #alerts-general in Slack.
Note that the alerts will contain a feature_category label, making it possible to route the alerts to the relevant teams.
Once we feel that the alerts are working well, we can route them to pagerduty.
Persistent Offenders
There are several jobs that are clearly going to alert a lot:
-
pages_domain_ssl_renewalfeature category:pages(cc @jhampton) has consistent error rates up to 40%: https://dashboards.gitlab.net/d/sidekiq-queue-detail?var-queue=pages_domain_ssl_renewal -
emails_on_pushfeature category:source_code_management(cc @m_gill) consistently takes longer than the 10s allocated to urgent jobs https://dashboards.gitlab.net/d/sidekiq-queue-detail?var-queue=emails_on_push -
reactive_cachingfeature categorynot_owned(cc me) consistently takes longer than the 10s allocated to urgent jobs https://dashboards.gitlab.net/d/sidekiq-queue-detail?var-queue=reactive_caching -
deployment:deployments_forward_deploymentfeature categorycontinuous_delivery(cc @csouthard) consistently takes longer than 10m to run, exceeding the 5m allocated for non-urgent jobs https://dashboards.gitlab.net/d/sidekiq-queue-detail?var-queue=deployment:deployments_forward_deployment -
project_exportfeature category:importers(cc @lmcandrew) frequently queues for over 10m, exceeding the 1 minute allocated for low-urgency jobs. Perhaps we should changeproject_exporttourgency: none? https://dashboards.gitlab.net/d/sidekiq-queue-detail?var-queue=project_export Proposal added gitlab-com/gl-infra/scalability#217 -
authorized_projectsfeature category:authentication_and_authorization. This worker is a bit of a mess right now. https://dashboards.gitlab.net/d/sidekiq-queue-detail?var-queue=authorized_projects
I propose that while we tackle these problems, we apply silences for them in AlertManager.