Optimize ConcurrencyLimit::ResumeWorker performance
What does this MR do and why?
In gitlab-com/gl-infra/production#20567 (comment 2777839251), we noticed some workers with jobs indefinitely accumulating in the concurrency limit queue. This is mainly due to rate of incoming jobs > rate of resuming jobs, and we saw this backlogged queue on the busiest workers.
This MR attempts to optimize the ConcurrencyLimit::ResumeWorker by resuming as many jobs as it can in 1 execution, instead of 5000 jobs at once.
References
Pseudocode from gitlab-com/gl-infra/production#20567 (comment 2779201333)
How to set up and validate locally
-
Apply this diff
diff --git a/app/workers/chaos/sleep_worker.rb b/app/workers/chaos/sleep_worker.rb index 43b851a9f264..41403388ae2e 100644 --- a/app/workers/chaos/sleep_worker.rb +++ b/app/workers/chaos/sleep_worker.rb @@ -9,6 +9,8 @@ class SleepWorker # rubocop:disable Scalability/IdempotentWorker sidekiq_options retry: 3 include ChaosQueue + concurrency_limit -> { 10 } + def perform(duration_s) Gitlab::Chaos.sleep(duration_s) end -
Schedule a lot of jobs
while true Chaos::SleepWorker.perform_async(1) end -
On a separate console, keep checking the queue size
Gitlab::SidekiqMiddleware::ConcurrencyLimit::ConcurrencyLimitService.new("Chaos::SleepWorker").queue_size -
Once there are enough jobs in the queue, stop the loop in step 2.
-
Check that queue size will be decreasing.
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.