Limit ConcurrencyLimitSampler to current queue workers

What does this MR do and why?

Limit ConcurrencyLimitSampler to current queue workers

Previously, any sidekiq shard were able to take an exclusive lease and report metrics for all workers.

This makes reporting for the concurrency limit inaccurate because some workers rely on GITLAB_SIDEKIQ_MAX_REPLICAS and SIDEKIQ_CONCURRENCY environment variable, which differs by each shard.

Now, the exclusive lease is scoped by shard (the queue), so each shard will run the sample separately and only sample the shard's workers.

Changelog: changed

References

Fixes gitlab-com/gl-infra/data-access/durability/team#269

How to set up and validate locally

  1. Set gitlab.yml as follows:

    ❯ cat config/gitlab.yml | yq '.production.sidekiq'
    log_format: json # (default is also supported)
    routing_rules: [["resource_boundary=cpu", "cpu_bound"], ["*", "default"]]
  2. Apply the following diff:

    diff --git a/lib/gitlab/metrics/samplers/concurrency_limit_sampler.rb b/lib/gitlab/metrics/samplers/concurrency_limit_sampler.rb
    index 3cd0dec2a4e0..b113ef3805ab 100644
    --- a/lib/gitlab/metrics/samplers/concurrency_limit_sampler.rb
    +++ b/lib/gitlab/metrics/samplers/concurrency_limit_sampler.rb
    @@ -10,7 +10,7 @@ class ConcurrencyLimitSampler < BaseSampler
             # - Prometheus scrapes occur every 1 minute
             # - Our sampler lease lasts for 5 minutes
             # - After writing metrics, we sleep for 30s until lease expires before resetting the metrics to 0.
    -        DEFAULT_SAMPLING_INTERVAL_SECONDS = 30
    +        DEFAULT_SAMPLING_INTERVAL_SECONDS = 1
             LEASE_TIMEOUT = 300
     
             # The sleep ensures that:
    @@ -59,6 +59,7 @@ def lease_key
             end
     
             def report_metrics
    +          puts "workers size #{workers.size}"
               workers.each do |w|
                 queue_size = concurrent_limit_service.queue_size(w.name)
                 report_queue_size(w, queue_size)
    
  3. Run sidekiq cluster as follows:

    $ bin/sidekiq-cluster cpu_bound
  4. You should see the workers size is printed as 100

  5. Run for the default queue

    $ bin/sidekiq-cluster default
  6. The printed workers size would be around 847

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Marco Gregorius

Merge request reports

Loading