SLO alerting for Sidekiq workers (!2002) · Merge requests · GitLab.com / Runbooks

Andrew Newdigate requested to merge slo-alerting-for-sidekiq-workers into master Mar 12, 2020

SLO Metrics

This change adds SLOs for three metrics, these metrics are:

Queue time
Execution time
Execution error rate

As a starting point, we will go with a 99% SLO, over a one month period. Ideally over time we would like to add more nines on the end of this, but lets start here.

The alerting uses the two standard burn windows across four rates: 1h/5m and 6h/30m.

Minimum RPS

Unfortunately, at present, many workers have terrible metrics. Most of these workers are low-rate jobs that don't have enough activity to alert on reliably.

For this iteration, we'll start off by ignoring any job, that, over a 6 hour period is called on average less than 4 times a minute.

In future iterations, we may be able to group these low-rate jobs into bundles with enough activity. If we do this, grouping by feature_category, so that the relevant team can still be notified of SLO breaches would possibly be a good option.

Alerts

At present, the alerts will only be sent to #alerts-general in Slack.

Note that the alerts will contain a feature_category label, making it possible to route the alerts to the relevant teams.

Once we feel that the alerts are working well, we can route them to pagerduty.

Persistent Offenders

There are several jobs that are clearly going to alert a lot:

pages_domain_ssl_renewal feature category: pages (cc @jhampton) has consistent error rates up to 40%: https://dashboards.gitlab.net/d/sidekiq-queue-detail?var-queue=pages_domain_ssl_renewal
emails_on_push feature category: source_code_management (cc @m_gill) consistently takes longer than the 10s allocated to urgent jobs https://dashboards.gitlab.net/d/sidekiq-queue-detail?var-queue=emails_on_push
reactive_caching feature category not_owned (cc me) consistently takes longer than the 10s allocated to urgent jobs https://dashboards.gitlab.net/d/sidekiq-queue-detail?var-queue=reactive_caching
deployment:deployments_forward_deployment feature category continuous_delivery (cc @csouthard) consistently takes longer than 10m to run, exceeding the 5m allocated for non-urgent jobs https://dashboards.gitlab.net/d/sidekiq-queue-detail?var-queue=deployment:deployments_forward_deployment
project_export feature category: importers (cc @lmcandrew) frequently queues for over 10m, exceeding the 1 minute allocated for low-urgency jobs. Perhaps we should change project_export to urgency: none? https://dashboards.gitlab.net/d/sidekiq-queue-detail?var-queue=project_export Proposal added gitlab-com/gl-infra/scalability#217 (closed)
authorized_projects feature category: authentication_and_authorization. This worker is a bit of a mess right now. https://dashboards.gitlab.net/d/sidekiq-queue-detail?var-queue=authorized_projects

I propose that while we tackle these problems, we apply silences for them in AlertManager.

Edited Mar 13, 2020 by Andrew Newdigate

SLO alerting for Sidekiq workers

SLO Metrics

Minimum RPS

Alerts

Persistent Offenders

Merge request reports