Skip to content

SLO alerting for Sidekiq workers

Andrew Newdigate requested to merge slo-alerting-for-sidekiq-workers into master

Related to gitlab-com/gl-infra/scalability#175 (closed)

SLO Metrics

This change adds SLOs for three metrics, these metrics are:

  1. Queue time
  2. Execution time
  3. Execution error rate

As a starting point, we will go with a 99% SLO, over a one month period. Ideally over time we would like to add more nines on the end of this, but lets start here.

The alerting uses the two standard burn windows across four rates: 1h/5m and 6h/30m.

Minimum RPS

Unfortunately, at present, many workers have terrible metrics. Most of these workers are low-rate jobs that don't have enough activity to alert on reliably.

For this iteration, we'll start off by ignoring any job, that, over a 6 hour period is called on average less than 4 times a minute.

In future iterations, we may be able to group these low-rate jobs into bundles with enough activity. If we do this, grouping by feature_category, so that the relevant team can still be notified of SLO breaches would possibly be a good option.

Alerts

At present, the alerts will only be sent to #alerts-general in Slack.

Note that the alerts will contain a feature_category label, making it possible to route the alerts to the relevant teams.

Once we feel that the alerts are working well, we can route them to pagerduty.

Persistent Offenders

There are several jobs that are clearly going to alert a lot:

I propose that while we tackle these problems, we apply silences for them in AlertManager.

Edited by Andrew Newdigate

Merge request reports