Limit ConcurrencyLimitSampler to current queue workers
What does this MR do and why?
Limit ConcurrencyLimitSampler to current queue workers
Previously, any sidekiq shard were able to take an exclusive lease and report metrics for all workers.
This makes reporting for the concurrency limit inaccurate because some workers rely on GITLAB_SIDEKIQ_MAX_REPLICAS and SIDEKIQ_CONCURRENCY environment variable, which differs by each shard.
Now, the exclusive lease is scoped by shard (the queue), so each shard will run the sample separately and only sample the shard's workers.
Changelog: changed
References
Fixes gitlab-com/gl-infra/data-access/durability/team#269
How to set up and validate locally
-
Set
gitlab.ymlas follows:❯ cat config/gitlab.yml | yq '.production.sidekiq' log_format: json # (default is also supported) routing_rules: [["resource_boundary=cpu", "cpu_bound"], ["*", "default"]] -
Apply the following diff:
diff --git a/lib/gitlab/metrics/samplers/concurrency_limit_sampler.rb b/lib/gitlab/metrics/samplers/concurrency_limit_sampler.rb index 3cd0dec2a4e0..b113ef3805ab 100644 --- a/lib/gitlab/metrics/samplers/concurrency_limit_sampler.rb +++ b/lib/gitlab/metrics/samplers/concurrency_limit_sampler.rb @@ -10,7 +10,7 @@ class ConcurrencyLimitSampler < BaseSampler # - Prometheus scrapes occur every 1 minute # - Our sampler lease lasts for 5 minutes # - After writing metrics, we sleep for 30s until lease expires before resetting the metrics to 0. - DEFAULT_SAMPLING_INTERVAL_SECONDS = 30 + DEFAULT_SAMPLING_INTERVAL_SECONDS = 1 LEASE_TIMEOUT = 300 # The sleep ensures that: @@ -59,6 +59,7 @@ def lease_key end def report_metrics + puts "workers size #{workers.size}" workers.each do |w| queue_size = concurrent_limit_service.queue_size(w.name) report_queue_size(w, queue_size) -
Run sidekiq cluster as follows:
$ bin/sidekiq-cluster cpu_bound -
You should see the workers size is printed as 100
-
Run for the default queue
$ bin/sidekiq-cluster default -
The printed workers size would be around 847
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.