Docs: When to steal background migrations
Question
In https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/22522#note_114997784:
Code checking for store = NULL will be removed.
This is fine because a background migration multiple releases ago backfilled store, and new records do not set it to NULL since before that.
In this same release, we will steal the background migration to make sure the data migration is complete.
Should the steal be done pre-deploy or post-deploy?
Background
There is an example in the docs https://gitlab.com/gitlab-org/gitlab-ce/blob/b0be58a1/doc/development/background_migrations.md#L139:
In a post-deployment migration you’ll need to ensure no jobs remain. Use Gitlab::BackgroundMigration.steal to process any remaining jobs in Sidekiq. Reschedule the migration to be run directly (i.e. not through Sidekiq) on any rows that weren’t migrated by Sidekiq. This can happen if, for instance, Sidekiq received a SIGKILL, or if a particular batch failed enough times to be marked as dead.
In that example, dropping the column must be done post-deploy or else there will be much erroring. But the steal doesn't have to be post-deploy.
steal only does work if background migrations:
- are still running
- due to an impatient admin going against zero-downtime recommendations (not our fault, so we can mostly ignore this).
- due to a bug making the BG migration run longer than expected (exceptional case).
- were killed (exceptional case).
The tradeoff
Depending on what the data migration does, and how much work is left, the:
- pre-deploy
stealavoids errors or invalid data, but the migration could run for e.g. a few days. - post-deploy
stealallows errors or invalid data for an indeterminate period of time, e.g. a few days.
Failure modes
- Pre-deploy migration taking over an hour: Admin can cancel, rollback, downgrade, and investigate.
- Post-deploy, while the steal is working, the app may start raising errors: Admin will rollback, downgrade, and investigate.
- Post-deploy, while the steal is working, the app may start acting on invalid data without raising: The admin may not find out.
As a conservative sysadmin, I would prefer the steal to be done pre-deploy, if the deploy leads to errors or invalid data.
Proposal
Update the docs to explain why/when you would choose to steal pre-deploy vs post-deploy.
Sidenote
There is another option that avoids "the tradeoff"-- a 3 stage release. I think this is unnecessarily conservative because the steal only operates in exceptional cases, but we could mention it in the docs too.