Tooling to prevent long running migrations

Today we had a production deployment running a migration for 1:13 minutes.

It was replacing an index on ci_builds, which is a huge table.

Release managers started investigating a stuck deployment and eventually founded out that suspicious migration.

How can we make sure something like this is surfaced at the review level?

Maybe we could add a rubocop check for migrations on problematic tables.

We may require the MR author to time the migration code on #database_lab

How long can a migration run without causing an outage?

The migrations job has a 5hr timeout, it installs the new GitLab package and run all the migrations.

I think we should make sure migrations are fast. A 1hr migration maybe could have been shipped as a post-deployment or background migration

If we decide that there are no other options that merging a long migration, how can we inform release managers?

If this is the only option, then I propose that release manager should be the one merging it. They are the one in control of the deployment process, and the ones who can best assess when such migration should run.

/cc @gitlab-org/delivery @gitlab-org/database-team @ssarka @sabrams @tigerwnz

Edited May 26, 2020 by Alessio Caiazza