Track job waiter attempts and timeouts
Add 2 counters to WaitableWorker#bulk_perform_async
-
job_waiter_started_total
: gets incremented when we callJobWaiter#wait
-
job_waiter_timeouts_total
: gets incremented ifJobWaiter#wait
returnedfalse
and the jobs haven't finished.
These counters need to have a label worker
that contains the classname.
From #148 (closed) and #148 (comment 287636734).
The authorized_projects queue uses a JobWaiter class, so a web request will schedule some sidekiq jobs, then wait 10 seconds for them to complete. In order to improve this job, we need to know how often we hit that timeout, by worker. Proposal:
maybe just two Prometheus counters makes more sense? Something like
job_waiters_started
andjob_waiters_timed_out
. We'd only need one label, the job class (which is almost always AuthorizedProjectsWorker anyway).
This is blocking some later improvements to this queue, because we need this data to decide which way to go with this. If it times out a lot, maybe we can just remove the wait? (And perhaps the latency-sensitive attribute.) But that might be risky, so we'd need to see the data.
Note that the slowest jobs for this worker are currently on the order of 30 seconds, so will always fail to meet the waiter's conditions.