Optimize ConcurrencyLimit::ResumeWorker performance

What does this MR do and why?

In gitlab-com/gl-infra/production#20567 (comment 2777839251), we noticed some workers with jobs indefinitely accumulating in the concurrency limit queue. This is mainly due to rate of incoming jobs > rate of resuming jobs, and we saw this backlogged queue on the busiest workers.

This MR attempts to optimize the ConcurrencyLimit::ResumeWorker by resuming as many jobs as it can in 1 execution, instead of 5000 jobs at once.

References

Pseudocode from gitlab-com/gl-infra/production#20567 (comment 2779201333)

How to set up and validate locally

  1. Apply this diff

    diff --git a/app/workers/chaos/sleep_worker.rb b/app/workers/chaos/sleep_worker.rb
    index 43b851a9f264..41403388ae2e 100644
    --- a/app/workers/chaos/sleep_worker.rb
    +++ b/app/workers/chaos/sleep_worker.rb
    @@ -9,6 +9,8 @@ class SleepWorker # rubocop:disable Scalability/IdempotentWorker
         sidekiq_options retry: 3
         include ChaosQueue
     
    +    concurrency_limit -> { 10 }
    +
         def perform(duration_s)
           Gitlab::Chaos.sleep(duration_s)
         end
    
  2. Schedule a lot of jobs

    while true
      Chaos::SleepWorker.perform_async(1)
    end
  3. On a separate console, keep checking the queue size

    Gitlab::SidekiqMiddleware::ConcurrencyLimit::ConcurrencyLimitService.new("Chaos::SleepWorker").queue_size
  4. Once there are enough jobs in the queue, stop the loop in step 2.

  5. Check that queue size will be decreasing.

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Marco Gregorius

Merge request reports

Loading