Skip to content
GitLab
Next
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • GitLab GitLab
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Locked Files
  • Issues 44,758
    • Issues 44,758
    • List
    • Boards
    • Service Desk
    • Milestones
    • Iterations
    • Requirements
  • Merge requests 1,330
    • Merge requests 1,330
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
    • Test Cases
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Container Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Metrics
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Code review
    • Insights
    • Issue
    • Repository
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • GitLab.orgGitLab.org
  • GitLabGitLab
  • Issues
  • #290008
Closed
Open
Issue created Dec 01, 2020 by Markus Koller@toupeira🌴Contributor

Remove duplicated services in data migration

Summary

The gitlab.com database holds some invalid data in the services table. A project is only allowed to have one service of each type but we found that there are duplicates. See &3366 (closed)

Improvements

Write a data migration to remove each duplicated service record.

We currently don't define an ORDER in the queries so the used record is not deterministic, but we can just assume the record returned by the default order is the correct one, and delete all others for the same project and service type. See the discussion in #282472 (comment 450051986).

Risks

  • This is a long-running migration, so we need to efficiently batch the records to avoid running into timeouts.
  • We haven't found the root cause for the duplicated records yet, so we might need to re-run this migration later.

Testing plan

We can verify the rollout of the migration by:

  • 1️⃣ Querying for the number of projects with duplicate records in services:

    SELECT count(*) FROM (SELECT DISTINCT "services"."project_id" FROM "services" WHERE "services"."project_id" IS NOT NULL GROUP BY "services"."project_id", "services"."type" HAVING (count(*) > 1)) sub;
  • 2️⃣ Querying for the number of duplicate records in services:

    WITH service_ids AS (SELECT max(id) AS max_id FROM services where project_id IS NOT NULL GROUP BY type, project_id) SELECT count(*) FROM services WHERE project_id IS NOT NULL AND id not IN (SELECT max_id FROM service_ids);
    • Note: This query is expensive and can take up to a minute on staging, so it needs SET statement_timeout=0;
  • 3️⃣ Querying for the number of project-associated records in services:

    SELECT count(*) FROM "services" WHERE "services"."project_id" IS NOT NULL;

The expected numbers for the different environments are:

GPRD / gitlab.com

Query Before After
1️⃣ 508 0
2️⃣ 715 0
3️⃣ 756070¹ 755355 (-715)

GSTG / staging.gitlab.com

Query Before After
1️⃣ 959831 0
2️⃣ 960056 0
3️⃣ 7886312¹ 6926256 (-960056)

¹ Queried at 2021-01-07 12:22:16+01:00, this number can increase with usage.

Edited Jan 07, 2021 by Markus Koller
Assignee
Assign to
Time tracking