Skip to content

Remove unnecessary onboarded namespaces from onboarding_progresses table

Doug Stull requested to merge 392792-reinstate-changes-from-112229 into master

What does this MR do and why?

Re-implementation of !112229 (merged) that was rolled back via !112619 (merged) due to a timeout issue seen in gitlab-com/gl-infra/production#8436 (closed)

From https://docs.gitlab.com/ee/development/database_review.html#preparation-when-adding-data-migrations:

If the migration itself is not reversible, details of how data changes could be reverted in the event of an incident. For example, in the case of a migration that deletes records (an operation that most of the times is not automatically reversible), how could the deleted records be recovered.

No reversion of data is necessary, we are holding on !111324 (merged) until this is successful.

I considered a background batched migration, but from a suggestion - I'm trying for a foreground batched migration

Migration

migration/rollback
❯ bin/rails db:migrate
main: == 20230221214519 RemoveIncorrectlyOnboardedNamespacesFromOnboardingProgress: migrating
main: == 20230221214519 RemoveIncorrectlyOnboardedNamespacesFromOnboardingProgress: migrated (0.0991s)

ci: == 20230221214519 RemoveIncorrectlyOnboardedNamespacesFromOnboardingProgress: migrating
ci: -- The migration is skipped since it modifies the schemas: [:gitlab_main].
ci: -- This database can only apply migrations in one of the following schemas: [:gitlab_ci, :gitlab_shared, :gitlab_internal].
ci: == 20230221214519 RemoveIncorrectlyOnboardedNamespacesFromOnboardingProgress: migrated (0.0607s)

❯ be rails db:rollback:main VERSION=20230221214519
main: == 20230221214519 RemoveIncorrectlyOnboardedNamespacesFromOnboardingProgress: reverting
main: == 20230221214519 RemoveIncorrectlyOnboardedNamespacesFromOnboardingProgress: reverted (0.0267s)

Query analysis

Notes:

  • Offset some of the numbers to simulate cold runs and overlapped on others to simulate real running of outer batch and then inner project query.
  • batch sizes I've tried tuning many times from higher(1000, 5000, 10000) to lower (100, 200). From the database testing job, the current seems most optimal for current query design.
full delete
DELETE
FROM "onboarding_progresses"
WHERE "onboarding_progresses"."id" >= 1
  AND "onboarding_progresses"."id"
    < 501
  AND "onboarding_progresses"."namespace_id" NOT IN (SELECT "projects"."namespace_id"
                                                     FROM "projects"
                                                     WHERE "projects"."name" IN ('Learn GitLab', 'Learn GitLab - Ultimate trial')
                                                       AND "projects"."namespace_id" IN
                                                           (SELECT "onboarding_progresses"."namespace_id"
                                                            FROM "onboarding_progresses"
                                                            WHERE "onboarding_progresses"."id" >= 1
                                                              AND "onboarding_progresses"."id"
                                                                < 501))
outer batch query
SELECT "onboarding_progresses"."id"
FROM "onboarding_progresses"
WHERE "onboarding_progresses"."id" >= 501
ORDER BY "onboarding_progresses"."id" ASC
LIMIT 500 OFFSET 501
Inner project query
SELECT "projects"."namespace_id"
FROM "projects"
WHERE "projects"."name" IN ('Learn GitLab', 'Learn GitLab - Ultimate trial')
  AND "projects"."namespace_id" IN (SELECT "onboarding_progresses"."namespace_id"
                                    FROM "onboarding_progresses"
                                    WHERE "onboarding_progresses"."id" >= 501
                                      AND "onboarding_progresses"."id" < 1001)

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #392792 (closed)

Edited by Doug Stull

Merge request reports