Geo: Ensure UNIQUE indexes are used for resources in registry tables on the Geo Tracking Database

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Summary

We have identified a racing condition on the creation of some registry entries in the Geo Tracking Database.

Additional context can be seen in this customer support issue: https://gitlab.com/gitlab-com/geo-customers/-/issues/246 (Internal Link)

What is the current bug behavior?

A racing condition can happen and is confirmed in at least two registry tables:

job_artifact_registry, which include the following indexes:

CREATE INDEX index_job_artifact_registry_on_artifact_id
    ON job_artifact_registry (artifact_id);

CREATE INDEX index_job_artifact_registry_on_retry_at
    ON job_artifact_registry (retry_at);

CREATE INDEX job_artifact_registry_failed_verification
    ON job_artifact_registry (verification_retry_at)
    WHERE ((state = 2) AND (verification_state = 3));

CREATE INDEX job_artifact_registry_needs_verification
    ON job_artifact_registry (verification_state)
    WHERE ((state = 2) AND (verification_state = ANY (ARRAY [0, 3])));

CREATE INDEX job_artifact_registry_pending_verification
    ON job_artifact_registry (verified_at)
    WHERE ((state = 2) AND (verification_state = 0));

allows for multiple entries that points to the same artifact_id .

file_registry, which include the following indexes:

create unique index file_registry_pkey
    on public.file_registry (id);

create index file_registry_failed_verification
    on public.file_registry (verification_retry_at)
    where ((state = 2) AND (verification_state = 3));

create index file_registry_needs_verification
    on public.file_registry (verification_state)
    where ((state = 2) AND (verification_state = ANY (ARRAY [0, 3])));

create index file_registry_pending_verification
    on public.file_registry (verified_at)
    where ((state = 2) AND (verification_state = 0));

create index index_file_registry_on_retry_at
    on public.file_registry (retry_at);

allows for multiple entries that points to the same file_id.

What is the expected correct behavior?

Each registry entry should always point to a single resource only.

Possible fixes

We need to create unique indexes for the tables that are missing one on the corresponding resource reference. This will prevent all downstream effects of this bad data.

Because we can't do that on a system with existing duplicates, a data migration step will need to be performed to remove duplications before creating the new indexes. Follow https://docs.gitlab.com/development/database/deduplicate_database_records/#deduplicate-strategy-for-large-tables. Note that this requires 3 releases and the last one must come after a required stop.

There are exactly 3 unique constraints that we need to add:

package_file_id on package_file_registry
artifact_id on job_artifact_registry
file_id on file_registry

Workarounds

Copied from #479131 (closed). To delete duplicate registry rows:

Open Rails console in secondary site

Get number of model record IDs that have duplicates

artifact_ids = Geo::JobArtifactRegistry.group(:artifact_id).having('COUNT(*) > 1').pluck(:artifact_id); artifact_ids.size
upload_ids = Geo::UploadRegistry.group(:file_id).having('COUNT(*) > 1').pluck(:file_id); upload_ids.size
package_file_ids = Geo::PackageFileRegistry.group(:package_file_id).having('COUNT(*) > 1').pluck(:package_file_id); package_file_ids.size

Output the IDs in case we lose connection later

puts 'BEGIN Artifact IDs', artifact_ids, 'END Artifact IDs'
puts 'BEGIN Upload IDs', upload_ids, 'END Upload IDs'
puts 'BEGIN Package File IDs', package_file_ids, 'END Package File IDs'

Save the terminal output

Delete all duplicates

Geo::JobArtifactRegistry.where(artifact_id: artifact_ids).delete_all
Geo::UploadRegistry.where(file_id: upload_ids).delete_all
Geo::PackageFileRegistry.where(package_file_id: package_file_ids).delete_all

Wait for background jobs to create the registry rows again and resync etc

Implementation plan

MR 1: Resolve known application race conditions that create duplicate rows

At some point before calling registry.start!, persist the registry if it isn't already. Maybe use safe_find_or_create_by here
Replace registry.start! with registry.reload.start!

MR 2: Document the workaround to manually delete duplicates to make it more easily findable

(This can be done parallel with MR 1)

Add this workaround to Geo Troubleshooting Common Errors doc

Follow up issue: #519745.

Edited Jul 09, 2025 by Michael Kozono