Add a metric for Sidekiq's ScheduledSet/RetrySet

In Sidekiq, jobs that are going to run in the future (using perform_in or perform_at) are stored in the "Scheduled Set" (a ZSET in Redis), and a poller thread in each Sidekiq periodically grabs the earliest job that is scheduled at or before "now", and enqueues it to the normal queue it should go in.

We currently have observability of how many jobs are in that set in the sidekiq_jobs_scheduled_size metric, but that doesn't tell us how far behind we are. In recent sidekiq incidents 1 and 2 part of the problem was dealing with a backlog in the scheduled set where the jobs that were being popped off and executed were expecting to have been dealt with an hour or more ago, when normally Sidekiq is able to keep up to within a small number of seconds (single digit usually). Indeed, there can be anywhere up to 200K entries in the scheduled set, which is fine if the expected run times are all in the future. But during the first incident we got to 1.8M entries and over an hour behind; during the second incident we got to 800K but we didn't looked at the magnitude of the delay.

The delay can also be observed in the Sidekiq web UI, but that's just a snapshot/point in time, and does not allow alerting

It should be trivial to add a probe to gitlab-exporter that looks at the Job that will be run next, either in the same way sidekiq does (https://github.com/mperham/sidekiq/blob/master/lib/sidekiq/scheduled.rb#L18) or with some other suitably quick Redis command, and report how far 'behind' we are. For example, if the next job to run was scheduled to run at 09:00 but it's currently 09:30, our "backlog" is 30 minutes/1800 seconds, and that's the metric to export. I propose now - scheduled_time as the expression to expose; this will be negative or near 0 in the happy case (only tasks due right around now or in the further future); significant positive values are a problem indicator. We may also like to clamp the value to 0 if negative; in that case the backlog is 0 and it doesn't matter at all how far in the future the next job is scheduled to run.

This would help us distinguish "200K in the queue and everything is fine" from "100K in the queue but we're 30 minutes beind". It's a crude metric, but could be useful during incidents, or as an indicator (alert) that something is wrong.

Even better would be to know how many jobs we have to process to get to a stable/static state (metric around 0), rather than just looking at the head, but that might be expensive to calculate, so it's a stretch goal and only if there is some efficient way to achieve it.

The RetrySet is implemented the same way and could be exported in the same way, either as the same metric with a distinguishing label, or a distinct metric (I'm somewhat inclined to the latter, but wouldn't fight hard for that opinion). We didn't observe any problems with that set during the incidents, but that's not to say it won't be a problem at other times.

Edited Jul 15, 2021 by Craig Miskell