Measure per-process sidekiq worker concurrency

Introduction

On GitLab.com, we have done a great deal of work simplifying and optimising the sidekiq configuration.

One of the effects of this change is that all the worker processes in a sidekiq cluster will process the same queues, and each queue will run in exactly on sidekiq cluster priority.

The new configuration is described in this file: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/blob/master/tools/sidekiq-config/sidekiq-queue-configurations.libsonnet

On of the advantages of this approach is that we can monitor each Sidekiq priority and determine how close to saturation the fleet is.

For example:

If the besteffort fleet runs on 6 nodes
Each node runs 8 sidekiq worker processes
Each sidekiq worker process has a concurrency of 15, meaning that 15 sidekiq jobs can be processed in parallel.

This means that at any moment, 6 * 8 * 15 = 720 besteffort priority jobs can be run at any time. If we hit this limit, jobs will start queueing up. If this is happening for short periods, its' fine, but if it's happening frequently, it means we probably need to scale the fleet up. Likewise, if we never reach the limit, we can probably scale the fleet down.

We can calculate how many workers are currently busy using the sidekiq_running_jobs metric.

Unfortunately we don't expose a metric for the concurrency of sidekiq.

The best value we have is the maximum concurrency configured in chef, but this is not very accurate and serves as an upper limit, rather than the actual value.

Proposal

If we expose the actual concurrency that each sidekiq worker process is using we can then calculate a saturation metric for Sidekiq workers. We could then use this value for forecasting, alerting, etc etc

This would be similar to the saturation metric we have for unicorn_workers (although this is currently broken, see #31494 (closed)).

I propose we use sidekiq_concurrency{worker="1"} 10. Note that since we run multiple sidekiq processes we need a dimension for which worker the value is coming from. We can then aggregate this value to the priority level to get the total available capacity for that priority.

In future, this metric would also be useful for autoscaling the sidekiq fleets.

cc @ayufan @craig-gomes

Edited May 31, 2022 by 🤖 GitLab Bot 🤖