Measure per-process sidekiq worker concurrency
Introduction
On GitLab.com, we have done a great deal of work simplifying and optimising the sidekiq configuration.
One of the effects of this change is that all the worker processes in a sidekiq cluster will process the same queues, and each queue will run in exactly on sidekiq cluster priority.
The new configuration is described in this file: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/blob/master/tools/sidekiq-config/sidekiq-queue-configurations.libsonnet
On of the advantages of this approach is that we can monitor each Sidekiq priority and determine how close to saturation the fleet is.
For example:
- If the
besteffort
fleet runs on 6 nodes - Each node runs 8 sidekiq worker processes
- Each sidekiq worker process has a concurrency of 15, meaning that 15 sidekiq jobs can be processed in parallel.
This means that at any moment, 6 * 8 * 15 = 720
besteffort priority jobs can be run at any time. If we hit this limit, jobs will start queueing up. If this is happening for short periods, its' fine, but if it's happening frequently, it means we probably need to scale the fleet up. Likewise, if we never reach the limit, we can probably scale the fleet down.
We can calculate how many workers are currently busy using the sidekiq_running_jobs
metric.
Unfortunately we don't expose a metric for the concurrency of sidekiq.
The best value we have is the maximum concurrency configured in chef, but this is not very accurate and serves as an upper limit, rather than the actual value.
Proposal
If we expose the actual concurrency that each sidekiq worker process is using we can then calculate a saturation metric for Sidekiq workers. We could then use this value for forecasting, alerting, etc etc
This would be similar to the saturation metric we have for unicorn_workers
(although this is currently broken, see #31494 (closed)).
I propose we use sidekiq_concurrency{worker="1"} 10
. Note that since we run multiple sidekiq processes we need a dimension for which worker the value is coming from. We can then aggregate this value to the priority
level to get the total available capacity for that priority.
In future, this metric would also be useful for autoscaling the sidekiq fleets.