Create sidekiq per worker alerts in Mimir
These alerts in thanos come from thanos-rules-jsonnet/sidekiq-queue-rules.jsonnet
In it, for Thanos, we use an extra aggregation set that is not sourced from SLIs, the [sidekiqWorkerQueueSLIs
]. This aggregation set is sourced from sidekiq_enqueued_jobs_total
. Because these recordings don't go through SLIs, they are not part of the our recording-rule-registry optimization. But they are emitted by the entire Rails fleet (Sidekiq+web,api,git,...), which means they have an incredibly high cardinality.
To work around this, I think we should switch the alerts that currently use to gitlab_background_jobs:queue:ops:rate_*
recordings, to a new metrics from the SLI-registry, both in Mimir and Thanos. This would involve the following steps:
- Extend the
sidekiq_queuing
SLI to have an ops rate usingsidekiq_enqueued_jobs_total
. Or, alternatively, add a separate SLI with just an operation rate for this. Some complications that might occur when adding it to the existing SLI (event though that would be a nice place to add it and extend it to an error rate):-
emittedBy
might not match the other metrics in this SLI, because those use serverside metrics while these are clientside. -
shard
might not match because since these are serverside metrics the advertised shard label might not match the label for the queue/worker that is being enqueued.
-
- Use
sli_aggregations:
for all of the alerts fromthanos-rules-jsonnet/sidekiq-queue-rules.jsonnet
that currently use thegitlab_background_jobs:queue:ops
aggregation in both Thanos & Mimir. This allows us to remove thesidekiq-per-worker-recording-rules.libsonnet
for both Thanos & Mimir. -
Optionally, Outside of this project: Add an error rate to
sidekiq_queueing
that uses theops_rate
-success_rate
as an error rate for monitoring jobs that fail to be dequeued? (This still needs a new issue)
To migrate these alerts, we'll need to extract sidekiq-queue-rules.jsonnet
out into a library that accepts a recording-rule-registry
object that is different for Mimir than it is for Thanos.
All alerts defined here can use the recording rule registry, except for the SidekiqJobsSkippedTooLong
alert, which uses sidekiq_jobs_skipped_total
. For this alert, we can continue using metrics straight from source. During normal operation, this metric is not emitted from the application. Only during incidents when we enable one of the feature flags for skipping jobs (example). So cardinality is less of a problem here.