Geo secondary disables concurrency_limit_resume_worker, leaving throttled Geo::EventWorker / Geo::SyncWorker jobs undrained
## Summary ConcurrencyLimit::ResumeWorker is the component that drains Sidekiq concurrency-limit throttled job lists by re-enqueuing deferred work. We have a known related issue in #591124 / !225235 where missing cron registration caused deferred jobs to stall indefinitely, and the fix was to move concurrency_limit_resume_worker cron config to CE because the middleware is not EE-only. This customer appears to hit a Geo-secondary-specific variant: on the affected Geo secondary, the Geo cron config watcher flips concurrency_limit_resume_worker back to disabled, while Geo worker jobs continue to accumulate in Redis throttled lists. ### Current bug behavior On the Geo secondary: large Redis throttled-job lists accumulated for: sidekiq:concurrency_limit:throttled_jobs:{geo/event_worker} sidekiq:concurrency_limit:throttled_jobs:{geo/sync_worker} a manual call to Gitlab::SidekiqMiddleware::ConcurrencyLimit::ConcurrencyLimitService.resume_processing!('Geo::EventWorker') removed jobs from the list in batches of 5000. after the manual run, the list did not continue draining on its own initially. concurrency_limit_resume_worker is defined as a 1-minute cron for ConcurrencyLimit::ResumeWorker. on the secondary, the Geo cron config watcher flipped that cron back to disabled. once ConcurrencyLimit::ResumeWorker was temporarily enabled, the throttled Geo lists began draining; Geo caught up to 100%, and Redis memory started decreasing. ### Expected behavior If Geo workers are using concurrency-limit throttling on a Geo secondary, there should be a reliable mechanism to resume/dequeue those throttled jobs automatically. The Geo secondary should not leave Geo::EventWorker / Geo::SyncWorker throttled lists effectively stranded because the resume cron is disabled. ### Why this looks related to #591124 Issue #591124 documents the general failure mode: if ConcurrencyLimit::ResumeWorker is not registered/scheduled, deferred jobs remain in the Redis throttle queue and are never resumed. This case appears to be the Geo-secondary equivalent: the resume path exists and works manually, but the secondary’s cron management disables the recurring worker needed to keep the throttled lists draining. ### Suspected root cause Geo secondaries disable most Sidekiq-Cron jobs, and concurrency_limit_resume_worker appears not to be allowed to remain enabled there, even though Geo workers can still accumulate concurrency-limit throttled jobs on the secondary. ### Impact Geo replication can stall or lag while throttled lists accumulate. Redis memory can grow substantially while those lists build up. Manual intervention may be required to resume processing until the system re-enters a draining state. ### Reproduction notes We do not yet have a clean standalone reproduction, but field evidence shows: Geo throttled lists accumulate on the secondary for Geo::EventWorker / Geo::SyncWorker. Manual resume_processing! drains list entries in batches of 5000. The secondary’s cron watcher flips concurrency_limit_resume_worker back to disabled. Temporarily enabling the resume worker causes the lists to start draining again, after which Geo catches up and Redis memory decreases. ### Proposed fix direction One of these likely needs to be true: - allow concurrency_limit_resume_worker to run on Geo secondaries when concurrency-limit throttled Geo jobs are possible, or - provide an equivalent Geo-secondary-safe resume mechanism for throttled lists, or - ensure Geo worker throttling cannot accumulate on secondaries without an automatic drain path. ### Related #591124 ConcurrencyLimit::ResumeWorker cron is EE-gated but concurrency limit middleware is not. !225235 Move ConcurrencyLimit::ResumeWorker cron config to CE.
issue