Resumed concurrency limit jobs can exceed concurrency limit when Sidekiq queue is backlogged

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Close this issue

Full context: gitlab-com/gl-infra/production-engineering#27871 (comment 2862190684) (internal)

We recently made changes to improve the throughput of jobs resumed !206836 (merged), from rescheduling a batch of deferred jobs per ResumeWorker execution, to rescheduling as many batches of jobs per ResumeWorker execution. The motivation was to optimize the ResumeWorker performance so that jobs in concurrency limit queue won't be accumulated indefinitely for workers with high rate of incoming jobs.

While this has helped to increase the ResumeWorker's performance, in some rare cases when there is a massive backlog in Sidekiq queue, the number of concurrent jobs could exceed the set concurrency limit. For some worker classes, having too many concurrent jobs could massively impact downstream resources like the database connection pool.

Example for Security::SyncProjectPolicyWorker :

There are 0 concurrent jobs (concurrent_worker_count ) for the worker
ConcurrencyLimit::ResumeWorker runs
1st call of next_batch_from_queue returns 200 (200 - 0 )
200 jobs are re-enqueued from concurrency limit queue to Sidekiq queue
If the sidekiq queue itself is massively backlogged, these 200 jobs are still waiting in the back of the queue.
concurrent_worker_count is now still 0 because none of the 200 jobs were being executed yet.
2nd call of next_batch_from_queue also returns 200.
Another 200 jobs are queued in Sidekiq queue.
Note that we don't check resumed jobs against concurrency limit https://gitlab.com/gitlab-org/gitlab/blob/c2a3e812d7e0d5be410e88cd9ac056d43b9481e8/lib/gitlab/sidekiq_middleware/concurrency_limit/middleware.rb#L57 and https://gitlab.com/gitlab-org/gitlab/blob/c2a3e812d7e0d5be410e88cd9ac056d43b9481e8/lib/gitlab/sidekiq_middleware/concurrency_limit/middleware.rb#L65-65, therefore any amount of resumed jobs in sidekiq queue could be executed concurrently.
This goes on and on for up to 5 minutes or until jobs in the concurrency limit queue has been cleared up.

It's worth reiterating that the above scenario only happens when the sidekiq queue is backlogged.

✅ Temporary solution

Due to capacity constraint, we're only going to implement the following fix:

Add an ops feature flag concurrency_limit_eager_resume_processing (default to false) to toggle between the previous mode (1 batch of jobs per execution) and eager mode (as many jobs in 5 minutes). This allows us to enable the FF on .com as we need the performance improvement, while self-managed and Dedicated won't be affected by this edge case.
- If .com happens to face this issue again, we can disable the FF concurrency_limit_eager_resume_processing .

💡 Long term solution (not implemented yet)

Some ideas to fix this properly:

Parallelize ResumeWorker gitlab-com/gl-infra/production#20567 (comment 2819626193).
- This would help with the reduced performance from reverting !206836 (merged).
- This should increase performance significantly, and we can re-enable the concurrency limit middleware for WebHooks::LogExecutionWorker and AuditEvents::AuditEventStreamingWorker gitlab-com/gl-infra/production#20567 (comment 2849039988).
Perhaps we can also break this loop https://gitlab.com/gitlab-org/gitlab/blob/cf81e34c7147fbfe77e4f7d3b6f1725571adda78/lib/gitlab/sidekiq_middleware/concurrency_limit/queue_manager.rb#L42-50 by checking the queue size. If the queue size exceeds a certain number (eg GITLAB_SIDEKIQ_MAX_REPLICAS * SIDEKIQ_CONCURRENCY * some_constant), let's break the loop and not resume any more jobs. We'll let the next execution of ResumeWorker to continue resumption.

Edited Dec 12, 2025 by 🤖 GitLab Bot 🤖