ConcurrencyLimit::ResumeWorker cron is EE-gated but concurrency limit middleware is not — CE instances permanently deadlock workers (#591124) · Issues · GitLab.org / GitLab

ConcurrencyLimit::ResumeWorker cron is EE-gated but concurrency limit middleware is not — CE instances permanently deadlock workers

### Summary `ConcurrencyLimit::ResumeWorker` cron registration is EE-gated (`Gitlab.ee do` block in `config/initializers/1_settings.rb`, line 850), but the concurrency limit middleware and `DEFAULT_CONCURRENCY_LIMIT_PERCENTAGE_BY_URGENCY` (introduced in MR !194881, milestone 18.3) are **not** EE-gated. Any CE instance with `GITLAB_SIDEKIQ_MAX_REPLICAS > 0` and `SIDEKIQ_CONCURRENCY > 0` (both set automatically by the Helm chart) permanently deadlocks workers using `deduplicate :until_executed`. ### GitLab version 18.9.0-ce (Helm chart deployment on Kubernetes, `gitlab-sidekiq-ce:v18.9.0`) ### What is the current bug behavior? Workers with `deduplicate :until_executed` that exceed their computed concurrency limit are deferred into a Redis throttle queue by `ConcurrencyLimit::Server` middleware. Because `ConcurrencyLimit::ResumeWorker` is never registered as a cron job in CE, these deferred jobs are never resumed. The `until_executed` dedup cookie remains in Redis indefinitely (by design per MR !208142), causing all subsequent enqueue attempts for the same idempotency key to be silently dropped as duplicates. **Observable symptoms:** - CI job traces are never flushed from Redis to object storage (`Ci::BuildTraceChunkFlushWorker` deadlocked) - The CI runner receives HTTP 202 ("accepted, but not yet completed") for 5 minutes on `PUT /api/v4/jobs/:id` until `ACCEPT_TIMEOUT` expires and the trace is discarded - CI pipelines take ~14 minutes instead of ~3 minutes - `sidekiq_client.log` shows repeated `job_status: deduplicated` / `deduplication.type: until executed` entries for `Ci::BuildTraceChunkFlushWorker` - Redis key `sidekiq:concurrency_limit:throttled_jobs:{ci/build_trace_chunk_flush_worker}` accumulates jobs that are never drained ### What is the expected correct behavior? `ConcurrencyLimit::ResumeWorker` should be registered as a cron job in CE, matching the EE behavior. The cron runs every minute, checks all workers with jobs in throttle queues, and re-enqueues them with `concurrency_limit_resume: true` so the middleware doesn't re-defer them. Alternatively, if concurrency limiting is not intended for CE, `DEFAULT_CONCURRENCY_LIMIT_PERCENTAGE_BY_URGENCY` and the `ConcurrencyLimit::Server`/`ConcurrencyLimit::Client` middleware should also be gated behind `Gitlab.ee`. ### Steps to reproduce 1. Deploy GitLab CE >= 18.3.0 via Helm chart (which sets `GITLAB_SIDEKIQ_MAX_REPLICAS=2` and `SIDEKIQ_CONCURRENCY=20`) 2. Run any CI pipeline that generates sufficient log output (e.g., phpstan static analysis) 3. Observe that `Ci::BuildTraceChunkFlushWorker` jobs are deferred by `ConcurrencyLimit::Server` into the throttle queue 4. Observe that no `ConcurrencyLimit::ResumeWorker` cron exists: ```ruby Sidekiq::Cron::Job.all.select { |j| j.name.include?('resume') } # => [] (empty on CE) ``` 5. Observe the runner receiving HTTP 202 for ~5 minutes per job until `ACCEPT_TIMEOUT` fires ### Root cause analysis In `config/initializers/1_settings.rb`, both `concurrency_limit_resume_worker` (line 953) and `pause_control_resume_worker` (line 950) are inside a `Gitlab.ee do` block (lines 850–1178). On CE instances, `Gitlab.ee?` returns `false`, so these cron jobs are never added to `Settings.cron_jobs` and never registered by `Gitlab::SidekiqConfig::CronJobInitializer.execute`. However, the following components are NOT EE-gated: - `DEFAULT_CONCURRENCY_LIMIT_PERCENTAGE_BY_URGENCY` in `app/workers/concerns/worker_attributes.rb` (MR !194881, milestone 18.3) - `ConcurrencyLimit::Server` and `ConcurrencyLimit::Client` middleware in `lib/gitlab/sidekiq_middleware.rb` - `get_concurrency_limit` / `calculate_default_limit_from_max_percentage` which compute non-zero limits when `GITLAB_SIDEKIQ_MAX_REPLICAS > 0` The Helm chart sets `GITLAB_SIDEKIQ_MAX_REPLICAS` to a non-zero value for both CE and EE deployments, activating concurrency limiting on CE without the corresponding drain mechanism. ### Impact This affects **all 70+ workers** using `deduplicate :until_executed` on any CE instance deployed via Helm chart (or any CE instance where `GITLAB_SIDEKIQ_MAX_REPLICAS > 0`). Most critical affected workers: | Worker | Urgency | Impact when deadlocked | |---|---|---| | `Ci::BuildTraceChunkFlushWorker` | high | CI traces lost, pipelines slow by ~5min/job | | `PipelineProcessWorker` | high | Pipelines hang permanently | | `MergeWorker` | high | Merges silently dropped | | `Ci::CancelPipelineWorker` | high | Cancel button does nothing | | `Issues::CloseWorker` | high | Issues stay open after MR merge | | `FlushCounterIncrementsWorker` | low (explicit 50% cap) | Project statistics corruption | | `Import::ReassignPlaceholderUserRecordsWorker` | low (hardcoded limit=4) | Import migration stalls forever | | `RunPipelineScheduleWorker` | low | Scheduled pipelines never trigger | Workers with explicit `concurrency_limit` declarations (e.g., `Import::ReassignPlaceholderUserRecordsWorker` with `concurrency_limit -> { 4 }`) are affected regardless of `GITLAB_SIDEKIQ_MAX_REPLICAS`. ### Workaround Register the cron jobs manually via `rails runner` on a webservice pod: ```ruby Sidekiq::Cron::Job.new( name: 'concurrency_limit_resume_worker', cron: '*/1 * * * *', class: 'ConcurrencyLimit::ResumeWorker' ).save Sidekiq::Cron::Job.new( name: 'pause_control_resume_worker', cron: '*/5 * * * *', class: 'PauseControl::ResumeWorker' ).save ``` These persist across pod restarts because `sidekiq-cron`'s `destroy_removed_jobs` only removes jobs with `source: "schedule"`, and manually created jobs get `source: "dynamic"`. Alternatively, set `GITLAB_SIDEKIQ_MAX_REPLICAS=0` on the Sidekiq pod to disable default concurrency limiting entirely (does not protect workers with explicit limits). ### Proposed fix Move the `concurrency_limit_resume_worker` and `pause_control_resume_worker` cron registrations out of the `Gitlab.ee do` block in `config/initializers/1_settings.rb`, placing them alongside the other non-EE cron jobs. ### Relevant merge requests - MR !194881 (milestone 18.3) — introduced `DEFAULT_CONCURRENCY_LIMIT_PERCENTAGE_BY_URGENCY`, activating default concurrency limits for all workers - MR !208142 (milestone 18.6) — reordered middleware, making `ResumeWorker` essential for draining deferred jobs - MR !211908 (milestone 18.6) — removed env var gate, hardcoded the middleware ordering - MR !174929 (milestone 17.7) — original middleware reorder fix (superseded by !208142) ### Environment - GitLab CE 18.9.0 via Helm chart on DigitalOcean Kubernetes - `GITLAB_SIDEKIQ_MAX_REPLICAS=2`, `SIDEKIQ_CONCURRENCY=20` - Sidekiq: 1 replica, concurrency=20 - Redis: single instance (`gitlab-redis-master-0`) /cc @marcogreg @schin1

issue