Skip to content

Background migration: merge topics with same name

What does this MR do and why?

This MR is the third step of #351871 (closed) and merges all topics with the same case insensitive name using a background database migration:

  • The topic with the largest number of assigned projects is kept.
  • The other topics with the same case insensitive name are deleted and the project assignments are transferred to the kept topic.
  • If the kept topic has no avatar or description, the avatar and description is taken from the deleted topic with the second most assigned projects, then from the third most, ...

This approach was discussed in #351871 (comment 830292330).

Example:

Before After
Topics:
- javascript (2710 projects)
- JavaScript (967 projects, avatar: 'js.png')
- Javascript (467 projects, description: 'foo...')
- JAVASCRIPT (11 projects)
- javaScript (6 projects)
Topics:
- javascript (4161 projects, avatar: 'js.png', description: 'foo...')
- other topics with same case insensitive name are removed

🛠 with at Siemens

/cc @bufferoverflow

Database Migration

Irreversible migration (data-deletion)

The background data migration introduced by this MR is designed to delete some topic records if they are duplicates. If an avatar file exists, it will also be deleted (application code). The associated project_topic records are transferred to the kept topic (with the same name) or deleted if an identical project_topic record already exists for the kept topic.

There is no way to bring back the deleted records as they are duplicates anyway.

User experience impact (data-deletion)

During this migration, duplicated topics are deleted. If the migration fails, topics with the same name may still exist. This would be no difference to the current situation.

Update/deletion query plans (data-deletion)

see !81503 (comment 892088813)

Query Plans (outdated)

ScheduleMergeTopicsWithSameName

Group topics by case insensitive name: Projects::Topic.select('LOWER(name) as name').group('LOWER(name)').having('COUNT(*) > 1')

SQL query
SELECT
    LOWER(name) AS name
FROM
    "topics"
GROUP BY
    LOWER(name)
HAVING (COUNT(*) > 1);

Explain Analyze

Gitlab::BackgroundMigration::MergeTopicsWithSameName

Check if topics exist: topics.offset(1).any?

SQL query
SELECT
    1 AS one
FROM
    "topics"
WHERE (LOWER(name) = 'topic1')
LIMIT 1 OFFSET 1;

Explain Analyze

Get 'topic to keep': topics.first

SQL query
SELECT
    "topics".*
FROM
    "topics"
WHERE (LOWER(name) = 'topic1')
ORDER BY
    "topics"."total_projects_count" DESC,
    "topics"."non_private_projects_count" DESC,
    "topics"."id" ASC
LIMIT 1;

Explain Analyze

Get 'topics to remove': topics.offset(1)

SQL query
SELECT
    "topics".*
FROM
    "topics"
WHERE (LOWER(name) = 'topic1')
ORDER BY
    "topics"."total_projects_count" DESC,
    "topics"."non_private_projects_count" DESC,
    "topics"."id" ASC
OFFSET 1;

Explain Analyze

Change project topic assignments from 'topic to remove' to 'topic to keep': ProjectTopic.[...].update_all(...)

SQL query
UPDATE
    "project_topics"
SET
    "topic_id" = 97
WHERE
    "project_topics"."topic_id" = 98
    AND "project_topics"."project_id" NOT IN (
        SELECT
            "project_topics"."project_id"
        FROM
            "project_topics"
        WHERE
            "project_topics"."topic_id" = 97);

Explain Analyze

Remove project topic assignments of 'topic to remove' that already exist for 'topic to keep': ProjectTopic.where(topic_id: topic.id).delete_all

SQL query
DELETE FROM "project_topics"
WHERE "project_topics"."topic_id" = 98;

Explain Analyze

Remove 'topics to delete': topics_to_remove.delete_all

SQL query
DELETE FROM "topics"
WHERE "topics"."id" IN (
        SELECT
            "topics"."id"
        FROM
            "topics"
        WHERE (LOWER(name) = 'topic1')
    ORDER BY
        "topics"."total_projects_count" DESC,
        "topics"."non_private_projects_count" DESC,
        "topics"."id" ASC
    OFFSET 1);

Explain Analyze

Get total projects count: total_projects_count(...)

SQL query
SELECT
    COUNT(*)
FROM
    "project_topics"
WHERE
    "project_topics"."topic_id" = 97;

Explain Analyze

Get non-internal projects count: non_private_projects_count(...)

SQL query
SELECT
    COUNT(*)
FROM
    "project_topics"
INNER JOIN projects ON project_topics.project_id = projects.id
WHERE
    "project_topics"."topic_id" = 97
    AND (projects.visibility_level > 0);

Explain Analyze

Update 'topic to keep': topic_to_keep.update(...)

SQL query
UPDATE
    "topics"
SET
    "updated_at" = '2022-03-08 09:14:45.882098',
    "total_projects_count" = 3
WHERE
    "topics"."id" = 97;

Explain Analyze

How to set up and validate locally

  1. Add topics with the same case insensitive name (you need to bypass the model validation).
  2. Assign these topics to different projects (project settings).
  3. Run the background migration.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Jonas Wälter

Merge request reports