Race condition in gitlab:db:repair_index causes intermittent failures

Summary

The gitlab:db:repair_index rake task fails intermittently when repairing unique indexes due to a race condition between deduplication and reindexing. New duplicate records are inserted after duplicates are deleted but before the index is rebuilt, causing the reindex operation to fail with unique constraint violations.

Steps to reproduce

  1. Run the repair task on a database with duplicate records in pm_packages:

    gitlab-rake gitlab:db:repair_index
  2. Observe the failure:

    W, [2025-12-10T06:14:40.798978]  WARN -- : Found 100 duplicates in 'pm_packages' for columns: purl_type,name
    I, [2025-12-10T06:14:40.913994]  INFO -- : Deleted 100 duplicate records from pm_packages
    I, [2025-12-10T06:14:40.915533]  INFO -- : Reindexing index 'i_pm_packages_purl_type_and_name'...
    rake aborted!
    ActiveRecord::RecordNotUnique: PG::UniqueViolation: ERROR:  could not create unique index "i_pm_packages_purl_type_and_name_ccnew11"
    DETAIL:  Key (purl_type, name)=(7, testpackage) is duplicated.
  3. Run the task again and observe:

    • Different number of duplicates reported
    • Different records failing
    • Confirms duplicates are being actively created during repair

What is the current bug behavior?

The repair task exhibits a race condition:

Current problematic flow:

# lib/gitlab/database/repair_index.rb
def run
  indexes.each do |index_name, index_config|
    if index_config['unique']
      deduplicate_data(...)  # Deletes duplicates in a transaction
    end                      # ← Transaction commits, lock released
    
    # ← RACE CONDITION: New duplicates inserted here by background jobs
    
    if index_exists?(...)
      reindex_index(...)     # ← Fails due to new duplicates
    end
  end
end

Timeline:

T1: BEGIN TRANSACTION
T2:   LOCK TABLE pm_packages IN SHARE ROW EXCLUSIVE MODE
T3:   Find 100 duplicates
T4:   Update references in pm_package_versions
T5:   DELETE 100 duplicate records
T6: COMMIT  ← Lock released
T7: ← Background job inserts new duplicate (purl_type=7, name='testpackage')
T8: REINDEX INDEX CONCURRENTLY i_pm_packages_purl_type_and_name
T9: ERROR: could not create unique index - duplicate key

The gap between T6 (lock release) and T8 (reindex start) allows concurrent operations to insert new duplicates.

What is the expected correct behavior?

The deduplication and reindexing should be atomic - no duplicates should be able to appear between these operations.

Root Cause Analysis

The issue occurs because:

  1. Deduplication runs in a transaction with SHARE ROW EXCLUSIVE MODE lock
  2. Transaction commits after deduplication, releasing the lock
  3. Background jobs (e.g., package metadata sync) can insert new duplicates
  4. Reindex operation starts and encounters the new duplicates
  5. Reindex fails with unique constraint violation

The current code structure separates these operations:

  • deduplicate_data() - runs in its own transaction
  • reindex_index() - runs separately after the transaction commits

Proposed Solution

Wrap both deduplication and reindexing in a single atomic transaction using non-concurrent index operations:

Key Changes

  1. Add non-concurrent SQL templates:

    • REINDEX_SQL_NON_CONCURRENT = "REINDEX INDEX %{index_name}"
    • CREATE_INDEX_SQL_NON_CONCURRENT = "CREATE%{unique_clause} INDEX %{index_name} ON %{table_name} (%{column_list})"
  2. Create new atomic method deduplicate_and_reindex:

    • Wraps deduplication and reindexing in single transaction
    • Maintains SHARE ROW EXCLUSIVE MODE lock throughout
    • Prevents race condition window
  3. Add helper methods:

    • reindex_index_non_concurrent - reindex without CONCURRENTLY
    • create_index_non_concurrent - create index without CONCURRENTLY
  4. Update main run method:

    • Calls deduplicate_and_reindex for unique indexes
    • Ensures atomic operation

Why Non-Concurrent?

REINDEX INDEX CONCURRENTLY cannot run inside a transaction block (PostgreSQL limitation). Since we already hold a SHARE ROW EXCLUSIVE MODE lock that blocks writes, using non-concurrent operations is appropriate:

  • Writes are already blocked by our lock
  • We need to stay in the transaction to prevent race condition
  • Non-concurrent is faster and uses less disk space
  • The lock ensures atomicity

Affected Tables/Indexes

All unique indexes in INDEXES_TO_REPAIR:

  • pm_packages.i_pm_packages_purl_type_and_name
  • pm_affected_packages.i_affected_packages_unique_for_upsert
  • merge_request_diff_commit_users.index_merge_request_diff_commit_users_on_org_id_name_email
  • topics.index_topics_on_organization_id_and_name
  • ci_refs.index_ci_refs_on_project_id_and_ref_path
  • ci_resource_groups.index_ci_resource_groups_on_project_id_and_key
  • environments.index_environments_on_project_id_and_name
  • sbom_components.idx_sbom_components_on_name_purl_type_component_type_and_org_id
  • tags.index_tags_on_name
  • container_repositories.index_container_repositories_on_project_id_and_name
  • audit_events_group_external_streaming_destinations.unique_idx_group_destinations_on_name_category_group
  • audit_events_instance_external_streaming_destinations.unique_idx_instance_destinations_on_name_category

Impact Assessment

Severity: High - Prevents successful index repair on production systems

Frequency: Intermittent - depends on timing of background jobs

Workaround:

  1. Pause background jobs that create duplicates
  2. Run repair task
  3. Resume background jobs

Files to Change

  • lib/gitlab/database/repair_index.rb
Edited by 🤖 GitLab Bot 🤖