Resumed concurrency limit jobs can exceed concurrency limit when Sidekiq queue is backlogged
<!--IssueSummary start--> <details> <summary> Everyone can contribute. [Help move this issue forward](https://handbook.gitlab.com/handbook/marketing/developer-relations/contributor-success/community-contributors-workflows/#contributor-links) while earning points, leveling up and collecting rewards. </summary> - [Close this issue](https://contributors.gitlab.com/manage-issue?action=close&projectId=278964&issueIid=579350) </details> <!--IssueSummary end--> Full context: https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/27871#note_2862190684 (internal) We recently made changes to improve the throughput of jobs resumed https://gitlab.com/gitlab-org/gitlab/-/merge_requests/206836, from rescheduling a batch of deferred jobs per `ResumeWorker` execution, to rescheduling as many batches of jobs per `ResumeWorker` execution. The motivation was to optimize the `ResumeWorker` performance so that jobs in concurrency limit queue won't be accumulated indefinitely for workers with high rate of incoming jobs. While this has helped to increase the ResumeWorker's performance, in some rare cases when there is a massive backlog in Sidekiq queue, the number of concurrent jobs could exceed the set concurrency limit. For some worker classes, having too many concurrent jobs could massively impact downstream resources like the database connection pool. Example for `Security::SyncProjectPolicyWorker` : * There are 0 concurrent jobs (`concurrent_worker_count` ) for the worker * `ConcurrencyLimit::ResumeWorker` runs * 1st call of [`next_batch_from_queue`](https://gitlab.com/gitlab-org/gitlab/blob/304d4c781978e027496ffe65af8e6fd50d04b719/lib/gitlab/sidekiq_middleware/concurrency_limit/queue_manager.rb#L43-43) returns 200 (`200 - 0` ) * 200 jobs are re-enqueued from concurrency limit queue to Sidekiq queue * **If the sidekiq queue itself is massively backlogged**, these 200 jobs are still waiting in the back of the queue. * `concurrent_worker_count` is now still 0 because none of the 200 jobs were being executed yet. * 2nd call of `next_batch_from_queue` also returns 200. * Another 200 jobs are queued in Sidekiq queue. * Note that we don't check resumed jobs against concurrency limit [https://gitlab.com/gitlab-org/gitlab/blob/c2a3e812d7e0d5be410e88cd9ac056d43b9481e8/lib/gitlab/sidekiq_middleware/concurrency_limit/middleware.rb#L57](https://gitlab.com/gitlab-org/gitlab/blob/c2a3e812d7e0d5be410e88cd9ac056d43b9481e8/lib/gitlab/sidekiq_middleware/concurrency_limit/middleware.rb#L57-57) and https://gitlab.com/gitlab-org/gitlab/blob/c2a3e812d7e0d5be410e88cd9ac056d43b9481e8/lib/gitlab/sidekiq_middleware/concurrency_limit/middleware.rb#L65-65, therefore any amount of resumed jobs in sidekiq queue could be executed concurrently. * This goes on and on for up to 5 minutes or until jobs in the concurrency limit queue has been cleared up. It's worth reiterating that the above scenario only happens **when the sidekiq queue is backlogged.** ### :white_check_mark: Temporary solution Due to capacity constraint, we're only going to implement the following fix: * Add an `ops` feature flag `concurrency_limit_eager_resume_processing` (default to false) to toggle between the previous mode (1 batch of jobs per execution) and eager mode (as many jobs in 5 minutes). This allows us to enable the FF on .com as we need the performance improvement, while self-managed and Dedicated won't be affected by this edge case. * If .com happens to face this issue again, we can disable the FF `concurrency_limit_eager_resume_processing` . ### :bulb: Long term solution (not implemented yet) Some ideas to fix this properly: * Parallelize `ResumeWorker` https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20567#note_2819626193. * This would help with the reduced performance from reverting https://gitlab.com/gitlab-org/gitlab/-/merge_requests/206836. * This should increase performance significantly, and we can re-enable the concurrency limit middleware for `WebHooks::LogExecutionWorker` and `AuditEvents::AuditEventStreamingWorker` https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20567#note_2849039988. * Perhaps we can also break this loop https://gitlab.com/gitlab-org/gitlab/blob/cf81e34c7147fbfe77e4f7d3b6f1725571adda78/lib/gitlab/sidekiq_middleware/concurrency_limit/queue_manager.rb#L42-50 by checking the queue size. If the queue size exceeds a certain number (eg `GITLAB_SIDEKIQ_MAX_REPLICAS * SIDEKIQ_CONCURRENCY * some_constant`), let's break the loop and not resume any more jobs. We'll let the next execution of `ResumeWorker` to continue resumption.
issue