Improve ConcurrencyLimit::ResumeWorker performance

Context

ConcurrencyLimit (https://docs.gitlab.com/ee/development/sidekiq/worker_attributes.html#concurrency-limit) is a mechanism to limit the number of concurrent workers running at any point of time.

In the incident https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18834+, the Search::Zoekt::IndexingTaskWorker was experiencing a build up of 10 million jobs.

image src

The concurrency limit was subsequently set to 0, which allows the ResumeWorker to clear up the backlog without being limited by the previously set limit of 100. This however took 2 days to clear the backlog.

This is important to address before implementing the throttling for all Sidekiq workers, otherwise we'll risk of always being backlogged.

Proposal

We're currently improving the ConcurrencyLimit::ResumeWorker performance by:

  1. Parallelize ConcurrencyLimit::ResumeWorker (gitlab-org/gitlab#499807 - closed)
  2. Bulk enqueue for ConcurrencyLimit::ResumeWorker (gitlab-org/gitlab#503732 - closed)
  3. Optimize queries to WAL LSN on concurrency limiter (#18 - closed)

Status 2025-02-05

With the improvements listed in the proposal above, we're seeing at least 3x improvements in the rate of clearing jobs #9 (comment 2332251637)

Edited Feb 05, 2025 by Marco Gregorius
Assignee Loading
Time tracking Loading