rate() currently doesn't work well to measure request rate on low-frequency sidekiq jobs, particularly with kubernetes
While trying to discover why we didn't get alerted for Ci::ArchiveTracesCronWorker failing for an extended period (gitlab-org/gitlab#330141 (closed)), we noticed that the 'request/operation rate' for those jobs was pinned to 0 even though the job was clearly running (just failing when it did). After some debugging:
- The raw metric is sidekiq_jobs_completion_seconds_bucket, which we are trying convert to a rate then summing that across the fleet
- There is one of these metrics for each VM + pod.
- Pods don't live very long, particularly relative to the intervals that some of these jobs run (hourly, daily, etc).
- The metric doesn't exist until one job has run, at which time it springs into existence with the value of 1, never existing with value 0.
- If a given pod/process never runs that job again in the rest of it's life (likely in many such low frequency cases), then the metric eventually evaporates having never held any value other than '1'
- When taking the rate of such a metric, there is no delta across time so the rate remains 0. We can sum that across the fleet all we like, but it will remain 0.
-
increase
anddelta
have the same problem.
This is (probably) reasonable behavior for Prometheus and unlikely to change anytime soon (see https://github.com/prometheus/prometheus/issues/1673) but is unfortunate for us.
The most obvious solution is to initialize all counters to 0 at process startup (https://www.section.io/blog/beware-prometheus-counters-that-do-not-begin-at-zero/ mentions this approach); there's still a small possible edge case (the job starts before the first scrape), but we might be saved by Rails boot times (if the scrapes start returning data before the pod actually starts pulling work). The other downside is what this will do to our metrics cardinality.
We do not have the luxury of doing nothing; at the moment, we simply cannot inspect (or alert on) the actual operation rate of low frequency jobs, which is a large blindspot. Other options include more centralized monitoring (counting keys in redis that are exported once not from every node/pod/process, or DeadMansSnitch type mechanisms for ensuring jobs successfully ran 'recently'), but they require either more complicated development effort (gitlab-exporter?!), or are less generic/automatic.
So first up: what's the cardinality impact of doing the simple solution? Is that acceptable? Could we manage it if we split prometheus further (one for just Sidekiq perhaps?), or do we need to look at the more involved efforts?
/cc @andrewn
Proposal
Based on the discussion below, we want Sidekiq processes to pre-set sidekiq_jobs_completion_seconds
to zero for all jobs that process expects to execute.
We can get that information in a way that's compatible with one-queue-per-worker and one-queue-per-shard by doing the following:
::Gitlab::SidekiqConfig
.workers
.reject { |worker| worker.klass.is_a?(Gitlab::SidekiqConfig::DummyWorker) }
.map { |worker| [worker.klass.to_s, ::Gitlab::SidekiqConfig::WorkerRouter.global.route(worker.klass)] }
.select { |(worker, queue)| Sidekiq.options[:queues].include?(queue) }
(As #1133 (comment 605043721) notes, we have a similar method in a Rake task so we might want to move that to a more convenient location and reuse it here.)
We don't want to pre-set any of the other Sidekiq metrics we export for now, just the jobs completion metric.