Fix race condition in gitlab:db:repair_index task

What does this MR do and why?

Fixes a race condition in the gitlab:db:repair_index rake task that causes intermittent failures when repairing unique indexes.

Problem: The task deletes duplicate records and then attempts to reindex, but new duplicates can be inserted between these two operations, causing the reindex to fail with unique constraint violations.

Solution: Wrap both deduplication and reindexing in a single atomic transaction using non-concurrent index operations to maintain the lock throughout the entire process.

Closes #583776

Zendesk ticket: https://gitlab.zendesk.com/agent/tickets/678309 (Internal only)

How to set up and validate locally

1. Create test duplicates

# In Rails console
pm_package = PackageMetadata::Package.create!(purl_type: 7, name: 'test-duplicate-1')
PackageMetadata::Package.create!(purl_type: 7, name: 'test-duplicate-1')  # Duplicate

# Verify duplicates exist
ActiveRecord::Base.connection.execute(<<~SQL)
  SELECT purl_type, name, COUNT(*) as count, ARRAY_AGG(id ORDER BY id) as ids
  FROM pm_packages
  WHERE purl_type = 7 AND name = 'test-duplicate-1'
  GROUP BY purl_type, name
  HAVING COUNT(*) > 1;
SQL

2. Test the fix

# Run the repair task
bundle exec rake gitlab:db:repair_index

# Should output:
# Processing index 'i_pm_packages_purl_type_and_name' on table 'pm_packages'...
# Index is unique. Deduplicating and reindexing atomically...
# Acquiring lock on 'pm_packages' to prevent concurrent modifications...
# Lock acquired. Proceeding with deduplication...
# Found X duplicates in 'pm_packages' for columns: purl_type,name
# Deleting duplicate records from pm_packages...
# Deleted X duplicate records from pm_packages
# Reindexing index (non-concurrent to maintain lock)...
# Index reindexed successfully (non-concurrent).
# Deduplication and reindexing completed atomically.

3. Verify no duplicates remain

SELECT purl_type, name, COUNT(*)
FROM pm_packages
GROUP BY purl_type, name
HAVING COUNT(*) > 1;
-- Should return 0 rows

4. Verify index is valid

SELECT indexname, indexdef
FROM pg_indexes
WHERE indexname = 'i_pm_packages_purl_type_and_name';
-- Should show the index exists

Technical Details

Changes Made

  1. Added non-concurrent SQL templates:

    • REINDEX_SQL_NON_CONCURRENT - for reindexing inside transactions
    • CREATE_INDEX_SQL_NON_CONCURRENT - for index creation inside transactions
  2. Created deduplicate_and_reindex method:

    • Wraps deduplication and reindexing in a single transaction
    • Maintains SHARE ROW EXCLUSIVE MODE lock throughout
    • Prevents race condition window
  3. Added helper methods:

    • reindex_index_non_concurrent - reindex without CONCURRENTLY
    • create_index_non_concurrent - create index without CONCURRENTLY
  4. Updated main run method:

    • Calls deduplicate_and_reindex for unique indexes
    • Ensures atomic operation

Why Non-Concurrent?

REINDEX INDEX CONCURRENTLY cannot run inside a transaction block (PostgreSQL limitation). Since we already hold a SHARE ROW EXCLUSIVE MODE lock that blocks writes, using non-concurrent operations is appropriate:

  • Writes are already blocked by our lock
  • We need to stay in the transaction to prevent race condition
  • Non-concurrent is faster and uses less disk space
  • The lock ensures atomicity

Before vs After

Before (Race Condition):

deduplicate_data(...)  # Transaction ends here, lock released
# ← Window for new duplicates to be inserted
reindex_index(...)     # Fails if duplicates were inserted

After (Atomic):

connection.transaction do
  LOCK TABLE
  deduplicate_data(...)
  reindex_index_non_concurrent(...)  # Lock still held
end  # Lock released only here

Performance Impact

Trade-offs:

Pros:

  • Eliminates race condition completely
  • Guarantees atomic operation
  • Faster reindex (non-concurrent)
  • Uses less disk space

⚠️ Cons:

  • Blocks writes during reindex (but already blocked during deduplication)
  • Should be run during maintenance window for large tables

Recommendation: For tables with < 10,000 duplicates, the impact is minimal. For larger tables, schedule during maintenance window.

Affected Tables

All unique indexes in INDEXES_TO_REPAIR:

  • pm_packages
  • pm_affected_packages
  • merge_request_diff_commit_users
  • topics
  • ci_refs
  • ci_resource_groups
  • environments
  • sbom_components
  • tags
  • container_repositories
  • audit_events_group_external_streaming_destinations
  • audit_events_instance_external_streaming_destinations

Files Changed

  • lib/gitlab/database/repair_index.rb
  • changelogs/unreleased/fix-repair-index-race-condition.yml
Edited by David Wainaina

Merge request reports

Loading