Improve reversibility of rename_column_concurrently and cleanup_concurrent_column_rename
Problem 1
Our method for changing column type is reversible as a migration+post-migration pair, but the migration by itself is irreversible (going forward then back) and similarly, the post-migration is irreversible backwards (going back then forward). This makes our deploys more fragile.
I.e. If we are changing a column's type, but another unrelated migration dies before the post-migrations are run, and we attempt to rollback, we will get an error like this.
Problem 2
We test rollbacks in CE and EE with this section in gitlab-ci.yml
:
.db-rollback: &db-rollback
<<: *dedicated-runner
<<: *except-docs
<<: *pull-cache
stage: test
script:
- bundle exec rake db:rollback STEP=120
- bundle exec rake db:migrate
It happened today that this job rolled back 120 steps, including this down
method in a post-migration:
rename_column_concurrently :web_hooks, :job_events, :build_events
But 120 steps was not far enough to run the down
method of the corresponding migration:
cleanup_concurrent_column_rename :web_hooks, :job_events, :build_events
So during bundle exec rake db:migrate
, it got the error:
bundle exec rake db:migrate
rake aborted!
StandardError: An error has occurred, all later migrations canceled:
PG::UndefinedObject: ERROR: trigger "trigger_688beaaec90d" for table "web_hooks" does not exist
: DROP TRIGGER trigger_688beaaec90d ON web_hooks
The error occurred during the up
method of the post-migration:
cleanup_concurrent_column_rename :web_hooks, :build_events, :job_events
As you can see, build_events
and job_events
are flipped, since the up
method of this post-migration was never intended to cleanup its own down
method.
So the db-rollback
job can fail just because it happened to rollback to in-between a migration and its corresponding post-migration.
Solutions
It would be ideal if we could make each migration or post-migration individually reversible. For example, cleanup_concurrent_column_rename
could check whether it needs to clean up the result of rename_column_concurrently :table, :foo, :bar
, or rename_column_concurrently :table, :bar, :foo
.
Effectively doing:
class RenameWebHooksBuildEventsToJobEvents < ActiveRecord::Migration
include Gitlab::Database::MigrationHelpers
DOWNTIME = false
disable_ddl_transaction!
def up
rename_column_concurrently :web_hooks, :build_events, :job_events
end
def down
if build_events_was_renamed_to_job_events?
cleanup_concurrent_column_rename :web_hooks, :build_events, :job_events
else
cleanup_concurrent_column_rename :web_hooks, :job_events, :build_events
end
end
end
and
class CleanupRenameWebHooksBuildEventsToJobEvents < ActiveRecord::Migration
include Gitlab::Database::MigrationHelpers
DOWNTIME = false
disable_ddl_transaction!
def up
if build_events_was_renamed_to_job_events?
cleanup_concurrent_column_rename :web_hooks, :build_events, :job_events
else
cleanup_concurrent_column_rename :web_hooks, :job_events, :build_events
end
end
def down
rename_column_concurrently :web_hooks, :job_events, :build_events
end
end
If this is not feasible, then to prevent the rollback job failure from happening again, perhaps we could make it rollback to a post-migration (non-inclusive)?
Discovery
We found this issue in a rollback job failure: https://gitlab.com/gitlab-org/gitlab-ee/-/jobs/29478413 Slack discussion following the failure: https://gitlab.slack.com/archives/C0XM5UU6B/p1502917205000146