Skip to content

Create sidekiq per worker alerts in Mimir

These alerts in thanos come from thanos-rules-jsonnet/sidekiq-queue-rules.jsonnet

In it, for Thanos, we use an extra aggregation set that is not sourced from SLIs, the [sidekiqWorkerQueueSLIs]. This aggregation set is sourced from sidekiq_enqueued_jobs_total. Because these recordings don't go through SLIs, they are not part of the our recording-rule-registry optimization. But they are emitted by the entire Rails fleet (Sidekiq+web,api,git,...), which means they have an incredibly high cardinality.

To work around this, I think we should switch the alerts that currently use to gitlab_background_jobs:queue:ops:rate_* recordings, to a new metrics from the SLI-registry, both in Mimir and Thanos. This would involve the following steps:

  1. Extend the sidekiq_queuing SLI to have an ops rate using sidekiq_enqueued_jobs_total. Or, alternatively, add a separate SLI with just an operation rate for this. Some complications that might occur when adding it to the existing SLI (event though that would be a nice place to add it and extend it to an error rate):
    • emittedBy might not match the other metrics in this SLI, because those use serverside metrics while these are clientside.
    • shard might not match because since these are serverside metrics the advertised shard label might not match the label for the queue/worker that is being enqueued.
  2. Use sli_aggregations: for all of the alerts from thanos-rules-jsonnet/sidekiq-queue-rules.jsonnet that currently use the gitlab_background_jobs:queue:ops aggregation in both Thanos & Mimir. This allows us to remove the sidekiq-per-worker-recording-rules.libsonnet for both Thanos & Mimir.
  3. Optionally, Outside of this project: Add an error rate to sidekiq_queueing that uses the ops_rate - success_rate as an error rate for monitoring jobs that fail to be dequeued? (This still needs a new issue)

To migrate these alerts, we'll need to extract sidekiq-queue-rules.jsonnet out into a library that accepts a recording-rule-registry object that is different for Mimir than it is for Thanos.

All alerts defined here can use the recording rule registry, except for the SidekiqJobsSkippedTooLong alert, which uses sidekiq_jobs_skipped_total. For this alert, we can continue using metrics straight from source. During normal operation, this metric is not emitted from the application. Only during incidents when we enable one of the feature flags for skipping jobs (example). So cardinality is less of a problem here.

Edited by Bob Van Landuyt