Make it easier to schedule and clean-up background migrations
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
In 9.4 we are going to ship background migrations.
We would like to use this technique to migrate pipeline stages, but I wonder if we should improve the mechanism before doing it.
New proposal
-
Make background migrations self contained within classes, like
MyBackgroundMigrationWhat I mean by "self contained" is to design a simple DSL to make it possible to define an isolated SQL query inside a migration class, to avoid the need of defining a query inside a regular migration, and then scheduling background migration from there. With this approach a background migration would have all data to be responsible for scheduling itself and cleaning up after itself.
-
Implement
MyMigration.schedule(1.week)Currently in order to schedule a background migration one needs to ask someone who has access to the production to count rows that need to be migrated. Then we need to manually set a delay, and calculate a time it will take to migrate stuff. This is quite fragile and error prone mechanism. But all the data we need is available already to make it possible to automatically calculate a batch size, and a delay time, so it should be possible to make it as simple as
MyMigration.schedule(1.week)where1.weekis the maximum migration time we allow.We can also raise an error if it is not possible to finish the migration within a specified time, what is not possible currently, and can lead to some problems when someone miscalculates the batch size and the delay time.
-
Implement
MyMigration.cleanup!Once we have a query / code responsible for getting rows that need to be migrated in the migration class, we can clean it up easier. This also allows us to implement additional fail-safe mechanisms and recover from race-conditions related to using import/export as described in https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/18448/diffs#0c2ce9344ef3941ff04aaaefd5fcb7c0689ff1ed_139_139.
-
Implement
MyMigration.progressandMyMigration.finished?Currently in order to see if a background migration is finished, we need to invoke a few complex commands in a rails console, to check 1. sidekiq queue with this migration, 2. sidekiq schedule sets with this migration. Adding these methods would make life of production engineers / administrators much easier, even without using this outside of
rails console.
Old proposal
Currently background migrations require scheduling jobs in bulk from a regular / post deployment migration. If we would like to migrate a lot of data, like adding a stage id reference to ci_builds we would immediately schedule a few milion asynchronous jobs, which is no better than a post deployment migration that we already implemented, that resulted in a downtime caused by db/io write spikes.
Make it possible to recover from Redis/Sidekiq crash.
There is no a good way to schedule background migrations in batches and giving database a room to take a breath between these. Scheduling the next batch from within a background migration maybe be considered a fragile approach, that is not fault tolerant, and can result in data integrity problems.
Make it easier to clean it up.
The cleanup strategy now is to create a new migration that goes into the next release, which would clean up after background migration. It would immediately execute all remaining background migrations. Docs describe following a scenario:
- Release A:
- Create a migration class that performs the migration for a row with a given ID.
- Deploy the code for this release, this should include some code that will
schedule jobs for newly created data (e.g. using an
after_createhook). - Schedule jobs for all existing rows in a post-deployment migration.
- Release B:
- Deploy code so that the application starts using the new column and stops scheduling jobs for newly created data.
- In a post-deployment migration you'll need to ensure no jobs remain. To do
so you can use
Gitlab::BackgroundMigration.stealto process any remaining jobs before continuing. - Remove the old column.
In fact, this scenario can easily expand to one more release:
- Release A:
- Create a migration class that performs the migration for a row with a given ID.
- Deploy the code for this release, this should include some code that will
schedule jobs for newly created data (e.g. using an
after_createhook). - Schedule jobs for all existing rows in a post-deployment migration.
- Release B:
- In a post-deployment migration you'll need to ensure no jobs remain. To do
so you can use
Gitlab::BackgroundMigration.stealto process any remaining jobs before continuing. - Release C:
- Deploy code so that the application starts using the new column does not schedule migrations for new data. All data needs to be migrated already at this point.
- Remove the old column.
Shipping migration that needs follow ups in three subsequent releases is a little error prone. It is easy to miss something.
Recognize an impact on customers updating across multiple versions
Shipping ci_builds.stage_id update migration as a background deployment, means that customers / users will need, of course, update with a downtime, but it will take long hours if ci_builds.stage_id migration is a foreground migration is that case. If we plan to implement delay between batches, as per point 3 in https://gitlab.com/gitlab-org/gitlab-ce/issues/34151
Every new batch will be scheduled 5 minutes after the previous batch was scheduled
we will need a mechanism that will skip delays when this is a foreground migration when updating across multiple versions.
I would love to hear @yorickpeterse's and @ayufan's thoughts on this. Thanks in advance!