Fix container_repositories index repair to handle one-to-one relationship
What does this MR do and why?
The container_repository_states table has a primary key constraint on
container_repository_id, enforcing a one-to-one relationship with
container_repositories. When duplicate container_repositories entries
exist, previously the repair task attempted to update all
container_repository_states references to point to the "good" ID,
which violated the primary key constraint when a state entry already
existed for that ID.
For example, suppose the container_repositories table:
| id | project_id | name |
|---|---|---|
| 1 | 1000 | test |
| 2 | 1000 | test |
| 3 | 1000 | test |
With this container_repository_states table:
| container_repository_id | verified_at |
|---|---|
| 1 | 2025-12-01 |
| 2 | 2025-12-02 |
| 3 | 2025-12-03 |
The old approach tried to fix the duplication in container_repository_states by remapping the container_repository_id:
UPDATE container_repository_states SET container_repository_id = 3 WHERE container_repository_id IN (1)
However, this failed with:
PG::UniqueViolation: duplicate key value violates unique constraint "container_repository_states_pkey"
The fix adds a new delete_orphaned option to reference configurations.
When set to true, the repair task deletes all references pointing to the
duplicate (bad) IDs before deleting the duplicate records. This approach:
- Deletes the
container_repository_statesrows for duplicate IDs. - Deletes the duplicate container_repository rows.
- Preserves the state entry for the good ID.
This is simpler and more appropriate than using deduplication_column for
one-to-one relationships where we don't need to preserve the orphaned
references.
References
ZD: https://gitlab.zendesk.com/agent/tickets/676052
How to set up and validate locally
- Run
gdk psql -d gitlabhq_test - Run these SQL queries:
-- 1. Create a test organization (required for namespace and project)
INSERT INTO organizations (
id,
name,
path,
created_at,
updated_at
)
VALUES (
999999,
'test-organization',
'test-organization',
NOW(),
NOW()
)
ON CONFLICT (id) DO NOTHING;
-- 2. Create a test namespace (required for project_namespace_id foreign key)
INSERT INTO namespaces (
id,
name,
path,
type,
created_at,
updated_at,
organization_id
)
VALUES (
999999,
'test-namespace',
'test-namespace',
'Project',
NOW(),
NOW(),
999999 -- Use the test organization we just created
)
ON CONFLICT (id) DO NOTHING;
-- 3. Create a test project (required for foreign key constraint)
INSERT INTO projects (
id,
name,
path,
namespace_id,
project_namespace_id,
organization_id,
created_at,
updated_at
)
VALUES (
999999,
'test-project',
'test-project',
999999, -- Use the test namespace we just created
999999, -- Use the test namespace we just created
999999, -- Use the test organization we just created
NOW(),
NOW()
)
ON CONFLICT (id) DO NOTHING;
-- 4. Drop the unique index to allow duplicates
DROP INDEX IF EXISTS index_container_repositories_on_project_id_and_name;
-- 5. Create test data: duplicate container_repositories entries
-- Simulate the scenario where we have duplicates with project_id=999999, name='test-repo'
INSERT INTO container_repositories (
project_id,
name,
created_at,
updated_at
)
VALUES
(999999, 'test-repo', NOW(), NOW()), -- ID: will be the "good" ID
(999999, 'test-repo', NOW(), NOW()); -- ID: will be the "bad" ID (duplicate)
-- 6. Create container_repository_states entries
-- We'll use the actual IDs from the previous insert
-- First, get the IDs dynamically
WITH inserted_ids AS (
SELECT id FROM container_repositories
WHERE project_id = 999999 AND name = 'test-repo'
ORDER BY id
)
INSERT INTO container_repository_states (container_repository_id, verified_at)
SELECT id, NOW() FROM inserted_ids;
- With this data, you should see:
gitlabhq_test=# \x
Expanded display is on.
gitlabhq_test=# select * from container_repository_states;
-[ RECORD 1 ]------------+------------------------------
verification_started_at |
verification_retry_at |
verified_at | 2025-12-20 05:47:55.426783+00
container_repository_id | 1
verification_state | 0
verification_retry_count | 0
verification_checksum |
verification_failure |
project_id | 999999
-[ RECORD 2 ]------------+------------------------------
verification_started_at |
verification_retry_at |
verified_at | 2025-12-20 05:47:55.426783+00
container_repository_id | 2
verification_state | 0
verification_retry_count | 0
verification_checksum |
verification_failure |
project_id | 999999
gitlabhq_test=# select * from container_repositories;
-[ RECORD 1 ]--------------------+---------------------------
id | 1
project_id | 999999
name | test-repo
created_at | 2025-12-20 05:47:55.424048
updated_at | 2025-12-20 05:47:55.424048
status |
expiration_policy_started_at |
expiration_policy_cleanup_status | 0
expiration_policy_completed_at |
last_cleanup_deleted_tags_count |
delete_started_at |
status_updated_at |
failed_deletion_count | 0
next_delete_attempt_at |
-[ RECORD 2 ]--------------------+---------------------------
id | 2
project_id | 999999
name | test-repo
created_at | 2025-12-20 05:47:55.424048
updated_at | 2025-12-20 05:47:55.424048
status |
expiration_policy_started_at |
expiration_policy_cleanup_status | 0
expiration_policy_completed_at |
last_cleanup_deleted_tags_count |
delete_started_at |
status_updated_at |
failed_deletion_count | 0
next_delete_attempt_at |
- Then run
RAILS_ENV=test bundle exec rake gitlab:db:repair_index. This should succeed and show the deduplication:
I, [2025-12-19T21:48:54.150239 #35056] INFO -- : Processing index 'index_container_repositories_on_project_id_and_name' on table 'container_repositories'...
I, [2025-12-19T21:48:54.150247 #35056] INFO -- : Index is unique. Checking for duplicate data...
I, [2025-12-19T21:48:54.150254 #35056] INFO -- : Checking for duplicates in 'container_repositories' for columns: project_id,name...
I, [2025-12-19T21:48:54.150327 #35056] INFO -- : SQL: SELECT ARRAY_AGG(id ORDER BY id ASC) as ids
FROM "container_repositories"
WHERE "project_id" IS NOT NULL AND "name" IS NOT NULL
GROUP BY "project_id", "name"
HAVING COUNT(*) > 1
W, [2025-12-19T21:48:54.153129 #35056] WARN -- : Found 1 duplicates in 'container_repositories' for columns: project_id,name
I, [2025-12-19T21:48:54.153308 #35056] INFO -- : Deleting orphaned references in 'container_repository_states' for column 'container_repository_id'...
I, [2025-12-19T21:48:54.153356 #35056] INFO -- : SQL: DELETE FROM "container_repository_states" WHERE "container_repository_id" IN (2)
I, [2025-12-19T21:48:54.154665 #35056] INFO -- : Deleted 1 orphaned references in 'container_repository_states'
I, [2025-12-19T21:48:54.154676 #35056] INFO -- : Deleting duplicate records from container_repositories...
I, [2025-12-19T21:48:54.154706 #35056] INFO -- : SQL: DELETE FROM "container_repositories" WHERE id IN (2)
I, [2025-12-19T21:48:54.155332 #35056] INFO -- : Deleted 1 duplicate records from container_repositories
The table should look something like:
gitlabhq_test=# select * from container_repository_states;
-[ RECORD 1 ]------------+------------------------------
verification_started_at |
verification_retry_at |
verified_at | 2025-12-20 05:47:55.426783+00
container_repository_id | 1
verification_state | 0
verification_retry_count | 0
verification_checksum |
verification_failure |
project_id | 999999
gitlabhq_test=# select * from container_repositories;
-[ RECORD 1 ]--------------------+---------------------------
id | 1
project_id | 999999
name | test-repo
created_at | 2025-12-20 05:47:55.424048
updated_at | 2025-12-20 05:47:55.424048
status |
expiration_policy_started_at |
expiration_policy_cleanup_status | 0
expiration_policy_completed_at |
last_cleanup_deleted_tags_count |
delete_started_at |
status_updated_at |
failed_deletion_count | 0
next_delete_attempt_at |
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.