Resumed concurrency limit jobs can exceed concurrency limit when Sidekiq queue is backlogged

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Full context: gitlab-com/gl-infra/production-engineering#27871 (comment 2862190684) (internal)

We recently made changes to improve the throughput of jobs resumed !206836 (merged), from rescheduling a batch of deferred jobs per ResumeWorker execution, to rescheduling as many batches of jobs per ResumeWorker execution. The motivation was to optimize the ResumeWorker performance so that jobs in concurrency limit queue won't be accumulated indefinitely for workers with high rate of incoming jobs.

While this has helped to increase the ResumeWorker's performance, in some rare cases when there is a massive backlog in Sidekiq queue, the number of concurrent jobs could exceed the set concurrency limit. For some worker classes, having too many concurrent jobs could massively impact downstream resources like the database connection pool.

Example for Security::SyncProjectPolicyWorker :

It's worth reiterating that the above scenario only happens when the sidekiq queue is backlogged.

Temporary solution

Due to capacity constraint, we're only going to implement the following fix:

  • Add an ops feature flag concurrency_limit_eager_resume_processing (default to false) to toggle between the previous mode (1 batch of jobs per execution) and eager mode (as many jobs in 5 minutes). This allows us to enable the FF on .com as we need the performance improvement, while self-managed and Dedicated won't be affected by this edge case.
    • If .com happens to face this issue again, we can disable the FF concurrency_limit_eager_resume_processing .

💡 Long term solution (not implemented yet)

Some ideas to fix this properly:

Edited by 🤖 GitLab Bot 🤖