Geo::BulkMarkPendingBatchWorker and BulkMarkVerificationPendingBatchWorker gets stuck in endless loop causing stalled replication
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Summary
Two similar issues thought to have the same underlying cause as these both have shared modules:-
- The
Geo::BulkMarkPendingBatchWorkercan get stuck in an endless loop, causing job artifact replication to stall on the secondary site. The worker continues running indefinitely and blocks normal replication progress. This was triggered by the "Resync all" button in the UI. - The
Geo::BulkMarkVerificationPendingBatchWorkercan get stuck in an endless loop causing verification to stall on the secondary site. This is thought to have been triggered by the "Reverify All" button in the UI.
Symptoms
- Job artifacts replication progress stalls or progresses very slowly
- The
Geo::BulkMarkPendingBatchWorkercontinues running and does not complete - Synced count may fluctuate or even decrease between data points
- High volume of "exclusive lease" errors may be observed in logs (relationship to root cause unclear):
- Message:
Cannot obtain an exclusive lease. There must be another instance already in execution. - Lease key:
geo_bulk_update_service:job_artifact_registry
- Message:
Steps to Reproduce
Note: I've been unable to reproduce this with 100+ job artifacts locally doing a resync; the customer instance this happened on most recently had 2.6m job artifacts. More work needed to investigate and reproduce this.
Observed Triggers
This issue has been observed in two different scenarios:
- First occurrence: Post-cutover while provisioning secondary (trigger unclear)
-
Second occurrence: After the following sequence:
- Connectivity loss between primary and secondary (48+ hours)
- Manual "Resync all" operation triggered from primary site's Admin area → Geo sites for CI Job Artifacts
- Replication begins but then stalls
-
Geo::BulkMarkPendingBatchWorkerjobs continue running indefinitely - In this case, root caller was
GraphqlController#execute(triggered from UI)
Current Workaround
Enable the feature flag drop_sidekiq_jobs_Geo::BulkMarkPendingBatchWorker on the primary site to drop the stuck worker jobs:
Feature.enable(:"drop_sidekiq_jobs_Geo::BulkMarkPendingBatchWorker")
This allows normal Geo periodic workers to resume processing the replication queue.
The feature flag can be disabled after ~10 minutes once the stuck jobs have been dropped.
Known Occurrences
This issue has been observed at least twice during GitLab Dedicated migrations:
- First occurrence - post-cutover while provisioning secondary
- Second occurrence - during active replication after connectivity loss and manual resync
Technical Context
The Geo::BulkMarkPendingBatchWorker marks registries as pending in batches to be resynchronized by Geo periodic workers. The service uses an exclusive lease to limit concurrent jobs that iterate over each registry table to 1, avoiding excessive database pressure and Redis cursor interference.
The worker appears to get stuck in an endless loop where it cannot complete its work, blocking normal replication progress.
Impact
- Geo replication stalls, preventing secondary sites from staying in sync
- Manual intervention required to resolve
- Potential data inconsistency between primary and secondary sites during the stalled period
Environment
- GitLab versions observed: 18.1.3, 18.2.7
- Affects: GitLab Dedicated (likely affects all Geo deployments)
- Component: Geo replication, specifically blob replication (Job Artifacts, potentially other blob types)
Related Issues
- Tracked in Geo issues meta: https://gitlab.com/gitlab-org/gitlab/-/issues/538825