Alert on throttled jobs being enqueued that are not being dequeued
Rebased on !2045 (merged)
This MR add an alert on urgency: throttled jobs than are not being dequeued.
Why is this important?
At present, there is no SLO on queue times on throttled jobs, since we cannot determine how long it will take for these jobs to run.
The downside of this is that, is the jobs were to stop being dequeued, we would not get any sort of alerts on the situation.
This change adds an alert that is not based on queue-time latency. Instead, it compares jobs being enqueued, to the rate at which jobs are failing, running or being completed.
For any ten minute period during which jobs are enqueued, we check a 20 minute period to ensure that jobs of the same type are running. If not, the alert will fire.
Why do this only for throttled jobs?
Originally, this change was intended to work for all jobs, but unfortunately there are some shortcomings in our metrics. Primarily, the sidekiq_enqueued_jobs_total metric does not differentiate between immediately scheduled, and future scheduled jobs. For some worker classes that are only occasionally fired, and always fired with perform_in or perform_at in the future, the alert can incorrectly fire, stating that the job has not been dequeued.
We could workaround this by adding a label to the sidekiq_enqueued_jobs_total metric, but for the moment, this is needed primarily for throttled jobs, so we'll roll that in as a first step.