Jon Jenkins requested to merge 370640-migration-squasher into master Nov 30, 2022

What Does This Do?

This is a rake task to facilitate the complex, tedious, and error-prone process of squashing migrations. This is normally done when GitLab increments by a major version - every migration before the major version release is squashed.

Considerations

When squashing migrations in our code base, there are multiple complicating factors:

In addition to migrations in the db/migrate folder, there are migrations in the `db/post_migrate' folder, as well as migrations in the 'ee' folder. This ends up not being as much of an issue, as our migration context has all paths specified and all migrations are interleaved chronologically according to version.
Migration files must be deleted of course, but some migrations have corresponding rspec files that must also be identified and removed. This is not very difficult, as the spec files also contain the version number which can be matched easily.
Background (non-batched) migrations utilize custom classes that are only used for the migration. These classes must be found and their files removed, and the corresponding rspec files must be found and removed. This problem is slightly more complex. In some cases, the classes are specified directly when passed to the background migration helper in the migration file, other times the class name is assigned to a constant. This is an extremely difficult problem to automate using simple text matching techniques. (Note: we're going to defer the solving of this task to a separate issue, which will entail simply finding orphaned migration classes.)
~~Since migration squashing must be done in groups of a few hundred, exactly which migrations will be squashed in which MR must be calculated and tracked by hand.~~ Given the new procedure I have outlined below, there is no reason to limit the squashes to a few hundred at a time.
The schema dump includes a special table, ar_internal_metadata, which is actually created when the database is created, which can be solved by extending the regular schema dumping task and setting a config variable. We don't need to do schema dumping anymore.
The db/schema_migrations folder is automatically updated as the result of a migration - this is not desirable during a migration squash and makes for a tedious git workflow when preparing an MR that squashes migrations, as multiple migrations run during the course of the squashing and the schema_migrations folder must be reset.
Due to the sheer volume of files deleted, manually including these all in a git commit is time-consuming.

Documentation for the old process (left in for posterity)

The Old Process

These rake tasks make it simple for any developer to prepare migration squashes.

First we plan all of the steps of the squashing process. Determine which is the final version we will squash to. In this case we're squashing up to version 20220517144749. From a clean git status, run bundle exec rake gitlab:db:squash:plan VERSION=20220517144749. This creates the file config/squash_plan.json, which should be committed to the GitLab repo. Optionally, examine this file to examine the plan.
Run bundle exec rake gitlab:db:squash:step. This:
1. Drops and recreates the database.
2. Analyzes the migration files involved in migrating up to the version specified in the current step of the plan, including tracing down all class names passed background migration helpers and associated spec files.
3. Migrates up to the requisite version, and performs a schema dump.
4. Updates migration file of the version we've migrated to with code to load and execute the schema dump.
5. Fixes db/schema_migrations to a default state.
6. Does all of the requisite git staging.
Run bundle exec rake gitlab:db:squash:finalize. This performs another cycle of drop -> create -> migrate to perform a full migration starting at the new db/init_structure.sql and working up until the latest one. If this is all successful, the current step in the plan is marked as 'complete'. (This should be staged with the current pending commit).
Examine the output of git status to verify that all looks well. If so, run git commit!

Repeat for each step after each MR is merged upstream.

Questions/Potential concerns

Finding associated classes for background migrations is done using RuboCop's own NodePattern matching library, which essentially matches nodes in a Ruby source AST tree. This works by taking a migration file and parsing the source into an AST tree structure, then recursively visiting every node and attempting to match relevant patterns that allow us to find the name of the class in question. This section of the code is dense and in general the NodePattern DSL is hard to read. My concerns around this code involve future maintainability, however, this is a cleaner solution than could be accomplished using simple pattern matching.

The other issues/questions I will highlight in comments.

The Process

Check out master
Run the rake task, passing the version containing the migrations you want to squash as an argument: bundle exec rake gitlab:db:squash\[15-11-stable-ee\] # I use zsh which requires brackets to be escaped. In this case, I'm assuming that I want to squash all migrations in 15-11-stable-ee.
Run git commit

Caveats

I am not currently attempting to find and remove orphaned migration classes - we will tackle that in #416132

I have evaluated the MR acceptance checklist for this MR.

Related to #370640 (closed)

Edited Jun 22, 2023 by Jon Jenkins

Rake task to squash migrations