Skip to content

Remove duplicate services in background migration

Markus Koller requested to merge 290008-remove-duplicate-services into master

What does this MR do?

This removes duplicated records of the same service type on the same project, using the current (non-deterministic) default order to decide which record to keep.

#290008 (closed)

Database Migrations

Post-deployment migration: db/post_migrate/20201207165956_remove_duplicate_services.rb

This post-deployment migration uses keyset pagination to bulk-queue background jobs, passing a list of project IDs where duplicate services exist.

Example query plans on staging:

Migration output:

$ rake db:migrate:up VERSION=20201207165956
== 20201207165956 RemoveDuplicateServices: migrating ==========================
== 20201207165956 RemoveDuplicateServices: migrated (0.0946s) =================
$ rake db:migrate:down VERSION=20201207165956
== 20201207165956 RemoveDuplicateServices: reverting ==========================
== 20201207165956 RemoveDuplicateServices: reverted (0.0000s) =================

Background migration: lib/gitlab/background_migration/remove_duplicate_services.rb

This background migration deletes duplicate records for each batch of project IDs.

Example query plans on staging:

Expected runtime

On production (via #database-lab):

On staging (via psql):

  • 959831 projects affected, resulting in 192 batches / ~6.4 hours total runtime.
  • ~960050+ duplicate records to be deleted (I couldn't find out the exact number because the query times out).
  • Most of these were a side-effect of testing instance integrations: gitlab-com/gl-infra/production#1651 (closed).
    If we exclude the inherited records (and clean them separately somehow), we'd have:
    • 41 projects affected, resulting in 1 batch / < 2 minutes total runtime.
    • ~249+ duplicate records to be deleted.
Previous timings from first version

Post-deployment migration: db/post_migrate/20201207165956_remove_duplicate_services.rb

This post-deployment migration uses keyset pagination to bulk-queue background jobs for all projects with services.

Example query plans on staging:

Migration output:

$ rake db:migrate:up VERSION=20201207165956
== 20201207165956 RemoveDuplicateServices: migrating ==========================
== 20201207165956 RemoveDuplicateServices: migrated (0.0946s) =================
$ rake db:migrate:down VERSION=20201207165956
== 20201207165956 RemoveDuplicateServices: reverting ==========================
== 20201207165956 RemoveDuplicateServices: reverted (0.0000s) =================

Background migration: lib/gitlab/background_migration/remove_duplicate_services.rb

This background migration processes each batch of projects.

Example query plans on staging:

Expected runtime:

  • We have around 720'000 projects with services on gitlab.com, but it's not known how many of these have duplicate services.
    • A previous production query 9 months ago resulted in 669'991 records which we expect to be deleted.
  • With a batch size of 500 that would be at maximum ~150 batches with ~5 hours total runtime.

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • [-] Label as security and @ mention @gitlab-com/gl-security/appsec
  • [-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • [-] Security reports checked/validated by a reviewer from the AppSec team

Related to #290008 (closed)

Edited by Markus Koller

Merge request reports