Remove duplicated services in data migration
The gitlab.com database holds some invalid data in the services table. A project is only allowed to have one service of each type but we found that there are duplicates. See &3366 (closed)
Write a data migration to remove each duplicated service record.
We currently don't define an
ORDER in the queries so the used record is not deterministic, but we can just assume the record returned by the default order is the correct one, and delete all others for the same project and service type. See the discussion in #282472 (comment 450051986).
- This is a long-running migration, so we need to efficiently batch the records to avoid running into timeouts.
- We haven't found the root cause for the duplicated records yet, so we might need to re-run this migration later.
We can verify the rollout of the migration by:
1️⃣Querying for the number of projects with duplicate records in
SELECT count(*) FROM (SELECT DISTINCT "services"."project_id" FROM "services" WHERE "services"."project_id" IS NOT NULL GROUP BY "services"."project_id", "services"."type" HAVING (count(*) > 1)) sub;
2️⃣Querying for the number of duplicate records in
WITH service_ids AS (SELECT max(id) AS max_id FROM services where project_id IS NOT NULL GROUP BY type, project_id) SELECT count(*) FROM services WHERE project_id IS NOT NULL AND id not IN (SELECT max_id FROM service_ids);
Note: This query is expensive and can take up to a minute on staging, so it needs
- Note: This query is expensive and can take up to a minute on staging, so it needs
3️⃣Querying for the number of project-associated records in
SELECT count(*) FROM "services" WHERE "services"."project_id" IS NOT NULL;
The expected numbers for the different environments are:
GPRD / gitlab.com
GSTG / staging.gitlab.com
¹ Queried at 2021-01-07 12:22:16+01:00, this number can increase with usage.