Resumed concurrency limit jobs can exceed concurrency limit when Sidekiq queue is backlogged
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Full context: gitlab-com/gl-infra/production-engineering#27871 (comment 2862190684) (internal)
We recently made changes to improve the throughput of jobs resumed !206836 (merged), from rescheduling a batch of deferred jobs per ResumeWorker execution, to rescheduling as many batches of jobs per ResumeWorker execution. The motivation was to optimize the ResumeWorker performance so that jobs in concurrency limit queue won't be accumulated indefinitely for workers with high rate of incoming jobs.
While this has helped to increase the ResumeWorker's performance, in some rare cases when there is a massive backlog in Sidekiq queue, the number of concurrent jobs could exceed the set concurrency limit. For some worker classes, having too many concurrent jobs could massively impact downstream resources like the database connection pool.
Example for Security::SyncProjectPolicyWorker :
- There are 0 concurrent jobs (
concurrent_worker_count) for the worker -
ConcurrencyLimit::ResumeWorkerruns - 1st call of
next_batch_from_queuereturns 200 (200 - 0) - 200 jobs are re-enqueued from concurrency limit queue to Sidekiq queue
- If the sidekiq queue itself is massively backlogged, these 200 jobs are still waiting in the back of the queue.
-
concurrent_worker_countis now still 0 because none of the 200 jobs were being executed yet. - 2nd call of
next_batch_from_queuealso returns 200. - Another 200 jobs are queued in Sidekiq queue.
- Note that we don't check resumed jobs against concurrency limit https://gitlab.com/gitlab-org/gitlab/blob/c2a3e812d7e0d5be410e88cd9ac056d43b9481e8/lib/gitlab/sidekiq_middleware/concurrency_limit/middleware.rb#L57 and https://gitlab.com/gitlab-org/gitlab/blob/c2a3e812d7e0d5be410e88cd9ac056d43b9481e8/lib/gitlab/sidekiq_middleware/concurrency_limit/middleware.rb#L65-65, therefore any amount of resumed jobs in sidekiq queue could be executed concurrently.
- This goes on and on for up to 5 minutes or until jobs in the concurrency limit queue has been cleared up.
It's worth reiterating that the above scenario only happens when the sidekiq queue is backlogged.
✅ Temporary solution
Due to capacity constraint, we're only going to implement the following fix:
- Add an
opsfeature flagconcurrency_limit_eager_resume_processing(default to false) to toggle between the previous mode (1 batch of jobs per execution) and eager mode (as many jobs in 5 minutes). This allows us to enable the FF on .com as we need the performance improvement, while self-managed and Dedicated won't be affected by this edge case.- If .com happens to face this issue again, we can disable the FF
concurrency_limit_eager_resume_processing.
- If .com happens to face this issue again, we can disable the FF
💡 Long term solution (not implemented yet)
Some ideas to fix this properly:
- Parallelize
ResumeWorkergitlab-com/gl-infra/production#20567 (comment 2819626193).- This would help with the reduced performance from reverting !206836 (merged).
- This should increase performance significantly, and we can re-enable the concurrency limit middleware for
WebHooks::LogExecutionWorkerandAuditEvents::AuditEventStreamingWorkergitlab-com/gl-infra/production#20567 (comment 2849039988).
- Perhaps we can also break this loop https://gitlab.com/gitlab-org/gitlab/blob/cf81e34c7147fbfe77e4f7d3b6f1725571adda78/lib/gitlab/sidekiq_middleware/concurrency_limit/queue_manager.rb#L42-50 by checking the queue size. If the queue size exceeds a certain number (eg
GITLAB_SIDEKIQ_MAX_REPLICAS * SIDEKIQ_CONCURRENCY * some_constant), let's break the loop and not resume any more jobs. We'll let the next execution ofResumeWorkerto continue resumption.