Add FF to eagerly resume jobs

What does this MR do and why?

As described in #579350, there is an edge case when sidekiq queue is massively backlogged, the number of concurrent jobs of a worker could far exceed the set concurrency limit.

With this change, ResumeWorker by default will resume a batch of jobs at a time. This means each execution of ResumeWorker only resumes a number of jobs up to the concurrency limit.

The main purpose of this MR is to protect self-managed and Dedicated instances from the edge case of "resumed jobs can exceed concurrency limit when Sidekiq queue is backlogged", which could put a lot of pressure to the database

With FF concurrency_limit_eager_resume_processing, it tries to resume as many batches of jobs as possible in 5 minutes. The FF concurrency_limit_eager_resume_processing, will be enabled for GitLab.com as the performance of resuming jobs is important especially for .com.

For self-managed and Dedicated, they will be back to resuming 1 batch of jobs per ResumeWorker execution, which was the state before !206836 (merged).

References

#579350 (comment 2868082620)

How to set up and validate locally

  1. Apply this diff

    diff --git a/app/workers/chaos/sleep_worker.rb b/app/workers/chaos/sleep_worker.rb
    index 43b851a9f264..41403388ae2e 100644
    --- a/app/workers/chaos/sleep_worker.rb
    +++ b/app/workers/chaos/sleep_worker.rb
    @@ -9,6 +9,8 @@ class SleepWorker # rubocop:disable Scalability/IdempotentWorker
         sidekiq_options retry: 3
         include ChaosQueue
     
    +    concurrency_limit -> { 10 }
    +
         def perform(duration_s)
           Gitlab::Chaos.sleep(duration_s)
         end
    
  2. Schedule a lot of jobs

    while true
      Chaos::SleepWorker.perform_async(1)
    end
  3. On a separate console, keep checking the queue size

    Gitlab::SidekiqMiddleware::ConcurrencyLimit::ConcurrencyLimitService.new("Chaos::SleepWorker").queue_size
  4. Once there are enough jobs in the queue, stop the loop in step 2.

  5. Check that queue size will be decreasing slowly after a while (this may take up to 1 minute due to ResumeWorker is a per-minute cron).

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Marco Gregorius

Merge request reports

Loading