Skip to content

Remove duplicated services in data migration

Summary

The gitlab.com database holds some invalid data in the services table. A project is only allowed to have one service of each type but we found that there are duplicates. See &3366 (closed)

Improvements

Write a data migration to remove each duplicated service record.

We currently don't define an ORDER in the queries so the used record is not deterministic, but we can just assume the record returned by the default order is the correct one, and delete all others for the same project and service type. See the discussion in #282472 (comment 450051986).

Risks

  • This is a long-running migration, so we need to efficiently batch the records to avoid running into timeouts.
  • We haven't found the root cause for the duplicated records yet, so we might need to re-run this migration later.

Testing plan

We can verify the rollout of the migration by:

  • 1️⃣ Querying for the number of projects with duplicate records in services:

    SELECT count(*) FROM (SELECT DISTINCT "services"."project_id" FROM "services" WHERE "services"."project_id" IS NOT NULL GROUP BY "services"."project_id", "services"."type" HAVING (count(*) > 1)) sub;
  • 2️⃣ Querying for the number of duplicate records in services:

    WITH service_ids AS (SELECT max(id) AS max_id FROM services where project_id IS NOT NULL GROUP BY type, project_id) SELECT count(*) FROM services WHERE project_id IS NOT NULL AND id not IN (SELECT max_id FROM service_ids);
    • Note: This query is expensive and can take up to a minute on staging, so it needs SET statement_timeout=0;
  • 3️⃣ Querying for the number of project-associated records in services:

    SELECT count(*) FROM "services" WHERE "services"."project_id" IS NOT NULL;

The expected numbers for the different environments are:

GPRD / gitlab.com

Query Before After
1️⃣ 508 0
2️⃣ 715 0
3️⃣ 756070¹ 755355 (-715)

GSTG / staging.gitlab.com

Query Before After
1️⃣ 959831 0
2️⃣ 960056 0
3️⃣ 7886312¹ 6926256 (-960056)

¹ Queried at 2021-01-07 12:22:16+01:00, this number can increase with usage.

Edited by Markus Koller