Geo: Ensure UNIQUE indexes are used for resources in registry tables on the Geo Tracking Database

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Summary

We have identified a racing condition on the creation of some registry entries in the Geo Tracking Database.

Additional context can be seen in this customer support issue: https://gitlab.com/gitlab-com/geo-customers/-/issues/246 (Internal Link)

What is the current bug behavior?

A racing condition can happen and is confirmed in at least two registry tables:

job_artifact_registry, which include the following indexes:

CREATE INDEX index_job_artifact_registry_on_artifact_id
    ON job_artifact_registry (artifact_id);

CREATE INDEX index_job_artifact_registry_on_retry_at
    ON job_artifact_registry (retry_at);

CREATE INDEX job_artifact_registry_failed_verification
    ON job_artifact_registry (verification_retry_at)
    WHERE ((state = 2) AND (verification_state = 3));

CREATE INDEX job_artifact_registry_needs_verification
    ON job_artifact_registry (verification_state)
    WHERE ((state = 2) AND (verification_state = ANY (ARRAY [0, 3])));

CREATE INDEX job_artifact_registry_pending_verification
    ON job_artifact_registry (verified_at)
    WHERE ((state = 2) AND (verification_state = 0));

allows for multiple entries that points to the same artifact_id .

file_registry, which include the following indexes:

create unique index file_registry_pkey
    on public.file_registry (id);

create index file_registry_failed_verification
    on public.file_registry (verification_retry_at)
    where ((state = 2) AND (verification_state = 3));

create index file_registry_needs_verification
    on public.file_registry (verification_state)
    where ((state = 2) AND (verification_state = ANY (ARRAY [0, 3])));

create index file_registry_pending_verification
    on public.file_registry (verified_at)
    where ((state = 2) AND (verification_state = 0));

create index index_file_registry_on_retry_at
    on public.file_registry (retry_at);

allows for multiple entries that points to the same file_id.

What is the expected correct behavior?

Each registry entry should always point to a single resource only.

Possible fixes

We need to create unique indexes for the tables that are missing one on the corresponding resource reference. This will prevent all downstream effects of this bad data.

Because we can't do that on a system with existing duplicates, a data migration step will need to be performed to remove duplications before creating the new indexes. Follow https://docs.gitlab.com/development/database/deduplicate_database_records/#deduplicate-strategy-for-large-tables. Note that this requires 3 releases and the last one must come after a required stop.

There are exactly 3 unique constraints that we need to add:

  • package_file_id on package_file_registry
  • artifact_id on job_artifact_registry
  • file_id on file_registry

Workarounds

Copied from #479131 (closed). To delete duplicate registry rows:

  1. Open Rails console in secondary site

  2. Get number of model record IDs that have duplicates

    artifact_ids = Geo::JobArtifactRegistry.group(:artifact_id).having('COUNT(*) > 1').pluck(:artifact_id); artifact_ids.size
    upload_ids = Geo::UploadRegistry.group(:file_id).having('COUNT(*) > 1').pluck(:file_id); upload_ids.size
    package_file_ids = Geo::PackageFileRegistry.group(:package_file_id).having('COUNT(*) > 1').pluck(:package_file_id); package_file_ids.size
  3. Output the IDs in case we lose connection later

    puts 'BEGIN Artifact IDs', artifact_ids, 'END Artifact IDs'
    puts 'BEGIN Upload IDs', upload_ids, 'END Upload IDs'
    puts 'BEGIN Package File IDs', package_file_ids, 'END Package File IDs'
  4. Save the terminal output

  5. Delete all duplicates

    Geo::JobArtifactRegistry.where(artifact_id: artifact_ids).delete_all
    Geo::UploadRegistry.where(file_id: upload_ids).delete_all
    Geo::PackageFileRegistry.where(package_file_id: package_file_ids).delete_all
  6. Wait for background jobs to create the registry rows again and resync etc

Implementation plan

MR 1: Resolve known application race conditions that create duplicate rows

  1. At some point before calling registry.start!, persist the registry if it isn't already. Maybe use safe_find_or_create_by here
  2. Replace registry.start! with registry.reload.start!

MR 2: Document the workaround to manually delete duplicates to make it more easily findable

(This can be done parallel with MR 1)

  1. Add this workaround to Geo Troubleshooting Common Errors doc

Follow up issue: #519745.

Edited by Michael Kozono