Skip to content

Geo::BulkMarkPendingBatchWorker and BulkMarkVerificationPendingBatchWorker gets stuck in endless loop causing stalled replication

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Summary

Two similar issues thought to have the same underlying cause as these both have shared modules:-

  • The Geo::BulkMarkPendingBatchWorker can get stuck in an endless loop, causing job artifact replication to stall on the secondary site. The worker continues running indefinitely and blocks normal replication progress. This was triggered by the "Resync all" button in the UI.
  • The Geo::BulkMarkVerificationPendingBatchWorker can get stuck in an endless loop causing verification to stall on the secondary site. This is thought to have been triggered by the "Reverify All" button in the UI.

Symptoms

  • Job artifacts replication progress stalls or progresses very slowly
  • The Geo::BulkMarkPendingBatchWorker continues running and does not complete
  • Synced count may fluctuate or even decrease between data points
  • High volume of "exclusive lease" errors may be observed in logs (relationship to root cause unclear):
    • Message: Cannot obtain an exclusive lease. There must be another instance already in execution.
    • Lease key: geo_bulk_update_service:job_artifact_registry

Steps to Reproduce

Note: I've been unable to reproduce this with 100+ job artifacts locally doing a resync; the customer instance this happened on most recently had 2.6m job artifacts. More work needed to investigate and reproduce this.

Observed Triggers

This issue has been observed in two different scenarios:

  1. First occurrence: Post-cutover while provisioning secondary (trigger unclear)
  2. Second occurrence: After the following sequence:
    • Connectivity loss between primary and secondary (48+ hours)
    • Manual "Resync all" operation triggered from primary site's Admin area → Geo sites for CI Job Artifacts
    • Replication begins but then stalls
    • Geo::BulkMarkPendingBatchWorker jobs continue running indefinitely
    • In this case, root caller was GraphqlController#execute (triggered from UI)

Current Workaround

Enable the feature flag drop_sidekiq_jobs_Geo::BulkMarkPendingBatchWorker on the primary site to drop the stuck worker jobs:

Feature.enable(:"drop_sidekiq_jobs_Geo::BulkMarkPendingBatchWorker")

This allows normal Geo periodic workers to resume processing the replication queue.

The feature flag can be disabled after ~10 minutes once the stuck jobs have been dropped.

Known Occurrences

This issue has been observed at least twice during GitLab Dedicated migrations:

  1. First occurrence - post-cutover while provisioning secondary
  2. Second occurrence - during active replication after connectivity loss and manual resync

Technical Context

The Geo::BulkMarkPendingBatchWorker marks registries as pending in batches to be resynchronized by Geo periodic workers. The service uses an exclusive lease to limit concurrent jobs that iterate over each registry table to 1, avoiding excessive database pressure and Redis cursor interference.

The worker appears to get stuck in an endless loop where it cannot complete its work, blocking normal replication progress.

Impact

  • Geo replication stalls, preventing secondary sites from staying in sync
  • Manual intervention required to resolve
  • Potential data inconsistency between primary and secondary sites during the stalled period

Environment

  • GitLab versions observed: 18.1.3, 18.2.7
  • Affects: GitLab Dedicated (likely affects all Geo deployments)
  • Component: Geo replication, specifically blob replication (Job Artifacts, potentially other blob types)

Related Issues

Edited by Scott Murray