Race condition in gitlab:db:repair_index causes intermittent failures

Summary

The gitlab:db:repair_index rake task fails intermittently when repairing unique indexes due to a race condition between deduplication and reindexing. New duplicate records are inserted after duplicates are deleted but before the index is rebuilt, causing the reindex operation to fail with unique constraint violations.

Steps to reproduce

Run the repair task on a database with duplicate records in pm_packages:
```
gitlab-rake gitlab:db:repair_index
```

Observe the failure:

W, [2025-12-10T06:14:40.798978]  WARN -- : Found 100 duplicates in 'pm_packages' for columns: purl_type,name
I, [2025-12-10T06:14:40.913994]  INFO -- : Deleted 100 duplicate records from pm_packages
I, [2025-12-10T06:14:40.915533]  INFO -- : Reindexing index 'i_pm_packages_purl_type_and_name'...
rake aborted!
ActiveRecord::RecordNotUnique: PG::UniqueViolation: ERROR:  could not create unique index "i_pm_packages_purl_type_and_name_ccnew11"
DETAIL:  Key (purl_type, name)=(7, testpackage) is duplicated.

Run the task again and observe:
- Different number of duplicates reported
- Different records failing
- Confirms duplicates are being actively created during repair

What is the current bug behavior?

The repair task exhibits a race condition:

Current problematic flow:

# lib/gitlab/database/repair_index.rb
def run
  indexes.each do |index_name, index_config|
    if index_config['unique']
      deduplicate_data(...)  # Deletes duplicates in a transaction
    end                      # ← Transaction commits, lock released
    
    # ← RACE CONDITION: New duplicates inserted here by background jobs
    
    if index_exists?(...)
      reindex_index(...)     # ← Fails due to new duplicates
    end
  end
end

Timeline:

T1: BEGIN TRANSACTION
T2:   LOCK TABLE pm_packages IN SHARE ROW EXCLUSIVE MODE
T3:   Find 100 duplicates
T4:   Update references in pm_package_versions
T5:   DELETE 100 duplicate records
T6: COMMIT  ← Lock released
T7: ← Background job inserts new duplicate (purl_type=7, name='testpackage')
T8: REINDEX INDEX CONCURRENTLY i_pm_packages_purl_type_and_name
T9: ERROR: could not create unique index - duplicate key

The gap between T6 (lock release) and T8 (reindex start) allows concurrent operations to insert new duplicates.

What is the expected correct behavior?

The deduplication and reindexing should be atomic - no duplicates should be able to appear between these operations.

Root Cause Analysis

The issue occurs because:

Deduplication runs in a transaction with SHARE ROW EXCLUSIVE MODE lock
Transaction commits after deduplication, releasing the lock
Background jobs (e.g., package metadata sync) can insert new duplicates
Reindex operation starts and encounters the new duplicates
Reindex fails with unique constraint violation

The current code structure separates these operations:

deduplicate_data() - runs in its own transaction
reindex_index() - runs separately after the transaction commits

Proposed Solution

Wrap both deduplication and reindexing in a single atomic transaction using non-concurrent index operations:

Key Changes

Add non-concurrent SQL templates:
- REINDEX_SQL_NON_CONCURRENT = "REINDEX INDEX %{index_name}"
- CREATE_INDEX_SQL_NON_CONCURRENT = "CREATE%{unique_clause} INDEX %{index_name} ON %{table_name} (%{column_list})"
Create new atomic method deduplicate_and_reindex:
- Wraps deduplication and reindexing in single transaction
- Maintains SHARE ROW EXCLUSIVE MODE lock throughout
- Prevents race condition window
Add helper methods:
- reindex_index_non_concurrent - reindex without CONCURRENTLY
- create_index_non_concurrent - create index without CONCURRENTLY
Update main run method:
- Calls deduplicate_and_reindex for unique indexes
- Ensures atomic operation

Why Non-Concurrent?

REINDEX INDEX CONCURRENTLY cannot run inside a transaction block (PostgreSQL limitation). Since we already hold a SHARE ROW EXCLUSIVE MODE lock that blocks writes, using non-concurrent operations is appropriate:

Writes are already blocked by our lock
We need to stay in the transaction to prevent race condition
Non-concurrent is faster and uses less disk space
The lock ensures atomicity

Affected Tables/Indexes

All unique indexes in INDEXES_TO_REPAIR:

pm_packages.i_pm_packages_purl_type_and_name
pm_affected_packages.i_affected_packages_unique_for_upsert
merge_request_diff_commit_users.index_merge_request_diff_commit_users_on_org_id_name_email
topics.index_topics_on_organization_id_and_name
ci_refs.index_ci_refs_on_project_id_and_ref_path
ci_resource_groups.index_ci_resource_groups_on_project_id_and_key
environments.index_environments_on_project_id_and_name
sbom_components.idx_sbom_components_on_name_purl_type_component_type_and_org_id
tags.index_tags_on_name
container_repositories.index_container_repositories_on_project_id_and_name
audit_events_group_external_streaming_destinations.unique_idx_group_destinations_on_name_category_group
audit_events_instance_external_streaming_destinations.unique_idx_instance_destinations_on_name_category

Impact Assessment

Severity: High - Prevents successful index repair on production systems

Frequency: Intermittent - depends on timing of background jobs

Workaround:

Pause background jobs that create duplicates
Run repair task
Resume background jobs

#372150 (closed) - Original index repair implementation
#523146 (closed) - Package metadata duplicate handling

Files to Change

lib/gitlab/database/repair_index.rb

Edited Dec 11, 2025 by 🤖 GitLab Bot 🤖