Improve reversibility of rename_column_concurrently and cleanup_concurrent_column_rename

Problem 1

Our method for changing column type is reversible as a migration+post-migration pair, but the migration by itself is irreversible (going forward then back) and similarly, the post-migration is irreversible backwards (going back then forward). This makes our deploys more fragile.

I.e. If we are changing a column's type, but another unrelated migration dies before the post-migrations are run, and we attempt to rollback, we will get an error like this.

Problem 2

We test rollbacks in CE and EE with this section in gitlab-ci.yml:

.db-rollback: &db-rollback
  <<: *dedicated-runner
  <<: *except-docs
  <<: *pull-cache
  stage: test
  script:
    - bundle exec rake db:rollback STEP=120
    - bundle exec rake db:migrate

It happened today that this job rolled back 120 steps, including this down method in a post-migration:

rename_column_concurrently :web_hooks, :job_events, :build_events

But 120 steps was not far enough to run the down method of the corresponding migration:

cleanup_concurrent_column_rename :web_hooks, :job_events, :build_events

So during bundle exec rake db:migrate, it got the error:

bundle exec rake db:migrate
rake aborted!
StandardError: An error has occurred, all later migrations canceled:

PG::UndefinedObject: ERROR:  trigger "trigger_688beaaec90d" for table "web_hooks" does not exist
: DROP TRIGGER trigger_688beaaec90d ON web_hooks

The error occurred during the up method of the post-migration:

cleanup_concurrent_column_rename :web_hooks, :build_events, :job_events

As you can see, build_events and job_events are flipped, since the up method of this post-migration was never intended to cleanup its own down method.

So the db-rollback job can fail just because it happened to rollback to in-between a migration and its corresponding post-migration.

Solutions

It would be ideal if we could make each migration or post-migration individually reversible. For example, cleanup_concurrent_column_rename could check whether it needs to clean up the result of rename_column_concurrently :table, :foo, :bar, or rename_column_concurrently :table, :bar, :foo.

Effectively doing:

class RenameWebHooksBuildEventsToJobEvents < ActiveRecord::Migration
  include Gitlab::Database::MigrationHelpers

  DOWNTIME = false

  disable_ddl_transaction!

  def up
    rename_column_concurrently :web_hooks, :build_events, :job_events
  end

  def down
    if build_events_was_renamed_to_job_events?
      cleanup_concurrent_column_rename :web_hooks, :build_events, :job_events
    else
      cleanup_concurrent_column_rename :web_hooks, :job_events, :build_events
    end
  end
end

and

class CleanupRenameWebHooksBuildEventsToJobEvents < ActiveRecord::Migration
  include Gitlab::Database::MigrationHelpers

  DOWNTIME = false

  disable_ddl_transaction!

  def up
    if build_events_was_renamed_to_job_events?
      cleanup_concurrent_column_rename :web_hooks, :build_events, :job_events
    else
      cleanup_concurrent_column_rename :web_hooks, :job_events, :build_events
    end
  end

  def down
    rename_column_concurrently :web_hooks, :job_events, :build_events
  end
end

If this is not feasible, then to prevent the rollback job failure from happening again, perhaps we could make it rollback to a post-migration (non-inclusive)?

Discovery

We found this issue in a rollback job failure: https://gitlab.com/gitlab-org/gitlab-ee/-/jobs/29478413 Slack discussion following the failure: https://gitlab.slack.com/archives/C0XM5UU6B/p1502917205000146

Edited Aug 16, 2017 by Michael Kozono
Assignee Loading
Time tracking Loading