Zero-downtime migrations
We want to do zero-downtime deploys on GitLab.com. For that to happen, we need to have zero-downtime migrations (this is necessary but not sufficient).
For a typical SaaS product, a zero-downtime migration can look something like this, with three deploys:
- Change code to not depend on old column (possibly removing it from the AR column cache), or to handle both old and new data formats.
- Remove column, migrate data, etc.
- Change code to remove the workarounds in step 1.
I've mentioned those two cases because those are the biggest I'm aware of: either we're removing a column, or we're migrating data from one format (or location) to another.
So for a SaaS, this is some work, but not a huge amount of work, and it has a really nice end result.
For a product that is both SaaS and has a package, like ours, this is different. We can provide packages that, when deployed in a specific sequence, don't have downtime. But we can't (easily?) provide packages that can be upgraded from an arbitrary version to an arbitrary version without downtime. So, my proposal is something like this:
- We come up with a versioning scheme which can never have downtime migrations. This might be as simple as the .0, .1, and .2 releases of any monthly release, or we might use three consecutive monthly releases, or we might even create special release numbering scheme, like 8.x.0-zd0, 8.x.0-zd1, 8.x.0-zd2. I don't really like the last one, but I'm open to suggestions.
- We write all of our migrations to match this model, and we have a special staging environment that can be used to test these. RCs may have a downtime migration by mistake, but any other release should not.
- We communicate clearly (in release posts and elsewhere) that the way to get zero-downtime deploys is this, and if you don't upgrade this way, you may need to take downtime. We can either use the existing
DOWNTIME = true
statement in the migrations, or try for something more granular.
Disclaimer: I'm not an expert on any of this, this is just how I see this happening. I'm totally open to other suggestions and idea.