Fix ConcurrencyLimitSampler reporting inconsistent limits

Context

This was discovered in gitlab-org/gitlab#553604 (comment 2731909956), after the use_max_concurrency_limit_percentage_as_default_limit FF is enabled.

The feature flag is to set default concurrency limit on all workers based on the worker's urgency the shard's available threads #215 (closed).

Problem

On the worker below, the Max limit should be a flat line, since it's a static limit.

We should be expecting the max limit to be a flat line of 75, based on 25% (low urgency) * 10 (catchall concurrency) * 30 (maxReplicas)

image.png

https://dashboards.gitlab.net/goto/8pfJITrHg?orgId=1

So, turns out this is due to the concurrency limit sampler can run from any shard, and the GITLAB_SIDEKIQ_MAX_REPLICAS would also change depending on the shard:

image.png

src

When catchall (the actual shard that the worker resides) reports the metrics, it correctly reports as 75, but if low-urgency-cpu-bound happens to report the metrics, it'd report as different number.

The way that ConcurrencyLimitSampler works is 1 process holding exclusive lease lock (amongst all other processes across all shards) and emit metrics above for all workers. For limits specifically, we also read GITLAB_SIDEKIQ_MAX_REPLICAS env var to calculate the max limit. Therefore different shards would report different limits, regardless of the worker itself.

Potential Solution

Instead of 1 process reporting metrics for all shards, we'll have 1 process per shard emitting the metrics, something like:

diff --git a/lib/gitlab/metrics/samplers/concurrency_limit_sampler.rb b/lib/gitlab/metrics/samplers/concurrency_limit_sampler.rb
index 4c959a2bfe9b..7a578cfce854 100644
--- a/lib/gitlab/metrics/samplers/concurrency_limit_sampler.rb
+++ b/lib/gitlab/metrics/samplers/concurrency_limit_sampler.rb
@@ -54,6 +54,10 @@ def lease_timeout
           LEASE_TIMEOUT
         end
 
+        def lease_key
+          "#{self.class.name.underscore}:shard:#{current_queues}"
+        end
+
         def report_metrics
           workers.each do |w|
             queue_size = concurrent_limit_service.queue_size(w.name)
@@ -80,7 +84,11 @@ def reset_metrics
         end
 
         def workers
-          Gitlab::SidekiqConfig.workers_without_default.map(&:klass)
+          Gitlab::SidekiqConfig.current_worker_queue_mappings.keys
+        end
+
+        def current_queues
+          Sidekiq.default_configuration.queues
         end
 
         def concurrent_limit_service
Edited by Marco Gregorius