Make it easier to schedule and clean-up background migrations

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Close this issue

In 9.4 we are going to ship background migrations.

We would like to use this technique to migrate pipeline stages, but I wonder if we should improve the mechanism before doing it.

New proposal

Make background migrations self contained within classes, like MyBackgroundMigration

What I mean by "self contained" is to design a simple DSL to make it possible to define an isolated SQL query inside a migration class, to avoid the need of defining a query inside a regular migration, and then scheduling background migration from there. With this approach a background migration would have all data to be responsible for scheduling itself and cleaning up after itself.
Implement MyMigration.schedule(1.week)

Currently in order to schedule a background migration one needs to ask someone who has access to the production to count rows that need to be migrated. Then we need to manually set a delay, and calculate a time it will take to migrate stuff. This is quite fragile and error prone mechanism. But all the data we need is available already to make it possible to automatically calculate a batch size, and a delay time, so it should be possible to make it as simple as MyMigration.schedule(1.week) where 1.week is the maximum migration time we allow.

We can also raise an error if it is not possible to finish the migration within a specified time, what is not possible currently, and can lead to some problems when someone miscalculates the batch size and the delay time.
Implement MyMigration.cleanup!

Once we have a query / code responsible for getting rows that need to be migrated in the migration class, we can clean it up easier. This also allows us to implement additional fail-safe mechanisms and recover from race-conditions related to using import/export as described in https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/18448/diffs#0c2ce9344ef3941ff04aaaefd5fcb7c0689ff1ed_139_139.
Implement MyMigration.progress and MyMigration.finished?

Currently in order to see if a background migration is finished, we need to invoke a few complex commands in a rails console, to check 1. sidekiq queue with this migration, 2. sidekiq schedule sets with this migration. Adding these methods would make life of production engineers / administrators much easier, even without using this outside of rails console.

/cc @smcgivern @yorickpeterse

Old proposal

### Make it possible to migrate large sets of data.

Currently background migrations require scheduling jobs in bulk from a regular / post deployment migration. If we would like to migrate a lot of data, like adding a stage id reference to ci_builds we would immediately schedule a few milion asynchronous jobs, which is no better than a post deployment migration that we already implemented, that resulted in a downtime caused by db/io write spikes.

Make it possible to recover from Redis/Sidekiq crash.

There is no a good way to schedule background migrations in batches and giving database a room to take a breath between these. Scheduling the next batch from within a background migration maybe be considered a fragile approach, that is not fault tolerant, and can result in data integrity problems.

Make it easier to clean it up.

The cleanup strategy now is to create a new migration that goes into the next release, which would clean up after background migration. It would immediately execute all remaining background migrations. Docs describe following a scenario:

Release A:
Create a migration class that performs the migration for a row with a given ID.
Deploy the code for this release, this should include some code that will schedule jobs for newly created data (e.g. using an after_create hook).
Schedule jobs for all existing rows in a post-deployment migration.
Release B:
Deploy code so that the application starts using the new column and stops scheduling jobs for newly created data.
In a post-deployment migration you'll need to ensure no jobs remain. To do so you can use Gitlab::BackgroundMigration.steal to process any remaining jobs before continuing.
Remove the old column.

In fact, this scenario can easily expand to one more release:

Release A:
Create a migration class that performs the migration for a row with a given ID.
Deploy the code for this release, this should include some code that will schedule jobs for newly created data (e.g. using an after_create hook).
Schedule jobs for all existing rows in a post-deployment migration.
Release B:
In a post-deployment migration you'll need to ensure no jobs remain. To do so you can use Gitlab::BackgroundMigration.steal to process any remaining jobs before continuing.
Release C:
Deploy code so that the application starts using the new column does not schedule migrations for new data. All data needs to be migrated already at this point.
Remove the old column.

Shipping migration that needs follow ups in three subsequent releases is a little error prone. It is easy to miss something.

Recognize an impact on customers updating across multiple versions

Shipping ci_builds.stage_id update migration as a background deployment, means that customers / users will need, of course, update with a downtime, but it will take long hours if ci_builds.stage_id migration is a foreground migration is that case. If we plan to implement delay between batches, as per point 3 in https://gitlab.com/gitlab-org/gitlab-ce/issues/34151

Every new batch will be scheduled 5 minutes after the previous batch was scheduled

we will need a mechanism that will skip delays when this is a foreground migration when updating across multiple versions.

I would love to hear @yorickpeterse's and @ayufan's thoughts on this. Thanks in advance! 💛

Edited Sep 13, 2025 by 🤖 GitLab Bot 🤖