Race condition in gitlab:db:repair_index causes intermittent failures
Summary
The gitlab:db:repair_index rake task fails intermittently when repairing unique indexes due to a race condition between deduplication and reindexing. New duplicate records are inserted after duplicates are deleted but before the index is rebuilt, causing the reindex operation to fail with unique constraint violations.
Steps to reproduce
-
Run the repair task on a database with duplicate records in
pm_packages:gitlab-rake gitlab:db:repair_index -
Observe the failure:
W, [2025-12-10T06:14:40.798978] WARN -- : Found 100 duplicates in 'pm_packages' for columns: purl_type,name I, [2025-12-10T06:14:40.913994] INFO -- : Deleted 100 duplicate records from pm_packages I, [2025-12-10T06:14:40.915533] INFO -- : Reindexing index 'i_pm_packages_purl_type_and_name'... rake aborted! ActiveRecord::RecordNotUnique: PG::UniqueViolation: ERROR: could not create unique index "i_pm_packages_purl_type_and_name_ccnew11" DETAIL: Key (purl_type, name)=(7, testpackage) is duplicated. -
Run the task again and observe:
- Different number of duplicates reported
- Different records failing
- Confirms duplicates are being actively created during repair
What is the current bug behavior?
The repair task exhibits a race condition:
Current problematic flow:
# lib/gitlab/database/repair_index.rb
def run
indexes.each do |index_name, index_config|
if index_config['unique']
deduplicate_data(...) # Deletes duplicates in a transaction
end # ← Transaction commits, lock released
# ← RACE CONDITION: New duplicates inserted here by background jobs
if index_exists?(...)
reindex_index(...) # ← Fails due to new duplicates
end
end
end
Timeline:
T1: BEGIN TRANSACTION
T2: LOCK TABLE pm_packages IN SHARE ROW EXCLUSIVE MODE
T3: Find 100 duplicates
T4: Update references in pm_package_versions
T5: DELETE 100 duplicate records
T6: COMMIT ← Lock released
T7: ← Background job inserts new duplicate (purl_type=7, name='testpackage')
T8: REINDEX INDEX CONCURRENTLY i_pm_packages_purl_type_and_name
T9: ERROR: could not create unique index - duplicate key
The gap between T6 (lock release) and T8 (reindex start) allows concurrent operations to insert new duplicates.
What is the expected correct behavior?
The deduplication and reindexing should be atomic - no duplicates should be able to appear between these operations.
Root Cause Analysis
The issue occurs because:
-
Deduplication runs in a transaction with
SHARE ROW EXCLUSIVE MODElock - Transaction commits after deduplication, releasing the lock
- Background jobs (e.g., package metadata sync) can insert new duplicates
- Reindex operation starts and encounters the new duplicates
- Reindex fails with unique constraint violation
The current code structure separates these operations:
-
deduplicate_data()- runs in its own transaction -
reindex_index()- runs separately after the transaction commits
Proposed Solution
Wrap both deduplication and reindexing in a single atomic transaction using non-concurrent index operations:
Key Changes
-
Add non-concurrent SQL templates:
REINDEX_SQL_NON_CONCURRENT = "REINDEX INDEX %{index_name}"CREATE_INDEX_SQL_NON_CONCURRENT = "CREATE%{unique_clause} INDEX %{index_name} ON %{table_name} (%{column_list})"
-
Create new atomic method
deduplicate_and_reindex:- Wraps deduplication and reindexing in single transaction
- Maintains
SHARE ROW EXCLUSIVE MODElock throughout - Prevents race condition window
-
Add helper methods:
-
reindex_index_non_concurrent- reindex without CONCURRENTLY -
create_index_non_concurrent- create index without CONCURRENTLY
-
-
Update main
runmethod:- Calls
deduplicate_and_reindexfor unique indexes - Ensures atomic operation
- Calls
Why Non-Concurrent?
REINDEX INDEX CONCURRENTLY cannot run inside a transaction block (PostgreSQL limitation). Since we already hold a SHARE ROW EXCLUSIVE MODE lock that blocks writes, using non-concurrent operations is appropriate:
- Writes are already blocked by our lock
- We need to stay in the transaction to prevent race condition
- Non-concurrent is faster and uses less disk space
- The lock ensures atomicity
Affected Tables/Indexes
All unique indexes in INDEXES_TO_REPAIR:
pm_packages.i_pm_packages_purl_type_and_namepm_affected_packages.i_affected_packages_unique_for_upsertmerge_request_diff_commit_users.index_merge_request_diff_commit_users_on_org_id_name_emailtopics.index_topics_on_organization_id_and_nameci_refs.index_ci_refs_on_project_id_and_ref_pathci_resource_groups.index_ci_resource_groups_on_project_id_and_keyenvironments.index_environments_on_project_id_and_namesbom_components.idx_sbom_components_on_name_purl_type_component_type_and_org_idtags.index_tags_on_namecontainer_repositories.index_container_repositories_on_project_id_and_nameaudit_events_group_external_streaming_destinations.unique_idx_group_destinations_on_name_category_groupaudit_events_instance_external_streaming_destinations.unique_idx_instance_destinations_on_name_category
Impact Assessment
Severity: High - Prevents successful index repair on production systems
Frequency: Intermittent - depends on timing of background jobs
Workaround:
- Pause background jobs that create duplicates
- Run repair task
- Resume background jobs
Related Issues/MRs
- #372150 (closed) - Original index repair implementation
- #523146 (closed) - Package metadata duplicate handling
Files to Change
lib/gitlab/database/repair_index.rb