Improve ConcurrencyLimit::ResumeWorker performance
Context
ConcurrencyLimit (https://docs.gitlab.com/ee/development/sidekiq/worker_attributes.html#concurrency-limit) is a mechanism to limit the number of concurrent workers running at any point of time.
In the incident https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18834+, the Search::Zoekt::IndexingTaskWorker was experiencing a build up of 10 million jobs.
The concurrency limit was subsequently set to 0, which allows the ResumeWorker to clear up the backlog without being limited by the previously set limit of 100. This however took 2 days to clear the backlog.
This is important to address before implementing the throttling for all Sidekiq workers, otherwise we'll risk of always being backlogged.
Proposal
We're currently improving the ConcurrencyLimit::ResumeWorker performance by:
- Parallelize ConcurrencyLimit::ResumeWorker (gitlab-org/gitlab#499807 - closed)
- Bulk enqueue for ConcurrencyLimit::ResumeWorker (gitlab-org/gitlab#503732 - closed)
- Optimize queries to WAL LSN on concurrency limiter (#18 - closed)
Status 2025-02-05
With the improvements listed in the proposal above, we're seeing at least 3x improvements in the rate of clearing jobs #9 (comment 2332251637)
