Decide on the future of post-deployment migrations
As part of our latest OKRs, I’ve been thinking about how to remove the blocking nature of post-deployment migrations. Currently, there are two main problems with post-migrations:
- Post-migrations are blockers for rollbacks, they can’t be rolled back due to the nature of their operations, when a package includes a post-migration and an incident needs to be fixed, we need to either roll forward or perform a hot-patch. These options have their disadvantages: preparing and deploying a merge request can take up to 6h, and hot-patching blocks auto-deploy processes, for reference rolling back a package only takes ~1h.
- Post-migrations are lengthy. The post-migration job has a timeout of 10 hours to fit all the possible operations a post-migration can execute. During this time the database can be impacted, on-call SRE are often paged about long-running transactions and other performance-related issues could be seen.
The purpose of this issue is to offer a big picture of where we’re at with post-migrations, and what can we do about them.
Status of post-deployment migrations
What are they used for?
Post-deployment migrations can and are used for these specific actions:
- For adding non-critical indexes that take a lot of time to be created, e.g indexes to high-traffic tables.
- For executing data migrations inline that take at most a few minutes.
- For scheduling background migrations.
- For clean-ups: e.g removing unused columns.
Note that none of these activities depend on the application code that was deployed, it’s a requirement for data and background migrations to completely isolate their code from the application one.
How are post-migrations executed?
Migrations, regular and post, are part of the deployer pipeline and run on staging and production. Post-migrations run after the fleet is updated, meanwhile, regular migrations run before the fleet is updated.
How often do we have post-migrations?
Speaking from experience, I could say that every day we have at least one package with a post-migration. I don’t have the data to back this up, but it can be gathered to corroborate this assumption.
Do we need to remove post-migrations from auto-deployments?
I’d say yes for the following reasons:
- Having post-migrations prevent rollbacks, which impacts directly on our ability to mitigate production incidents, e.g production#5340 (closed) was an incident that we had to hot patch because the rollback option wasn’t available.
- Post-migrations are time-consuming, adding indexes or scheduling background migrations can take considerable time which has a direct impact on MTTP.
- Post-migrations are also often involved in incidents due to their length or the migrations that were executed. Some recent examples production#5339 (closed), production#5256 (closed)
What is stopping us from extracting post-deployment migrations from the auto-deployments?
Some reasons:
- There are no alternatives for these migrations:
- Post-migrations can’t be executed as regular migrations since they’re not critical to the application code, also treating them as regular migrations would increase the time of the
<env>-migration
job, which is not ideal. - Transferring the post-migrations to release-tools is also not ideal since that moves the problem to another part of the coordinated pipeline.
- Post-migrations can’t be executed as regular migrations since they’re not critical to the application code, also treating them as regular migrations would increase the time of the
- From the engineering side, there’s uncertainty about timings. We have toolings to indicate when an MR was deployed to production, but we don’t have any tooling to inform when a background migration finished, or when a large index was created. As a result, it's difficult for engineers to be informed when their post-migration was executed, and they often ask in Slack channels about its status. Moving them outside the deployer pipeline without any tooling could increase this uncertainty.
- We need a way to execute post-migrations at the same pace/speed they’re executed now to avoid interfering with Development Velocity.
If we want to remove post-migrations from the auto-deployments, what options do we have?
Note: This section is updated based on the analysis made in this issue
- Option A Execute post-migrations independently from the coordinated pipeline
- Option B: Post-migrations are executed only in specific coordinated pipelines
- Option C: Post-migrations are executed at the end of the week.
- Option D: Execute post-migrations in a specific schedule based on their nature.
- Option E (option selected): Removal of post-deployment migrations from the deployment process
Expand for details
Option A: Execute post-migrations independently from the coordinated pipeline
We could trigger a daily pipeline schedule, or another fancier mechanism, that searches the pending post-migrations and execute them in a specific schedule.
Pros:
- Post-migrations will be dissociated from the coordinated pipeline making all packages suitable to rollback if needed
- Development velocity is not affected, the process for creating and merging an MR with a post-migration stays the same.
- Deployer pipeline is simplified, one job is removed from staging and production.
- Deployment duration is reduced in general, decreasing the MTTP and improving the Deployment SLO apdex.
Cons:
- We’d need to adjust our toolings to consider this and define the times to execute the post-migration job.
- For simplicity, we should execute the post-migration job when the gprd environment is unlocked (no ongoing deployment), which can be complex to coordinate.
- Post-migrations duration and the content will be the same, so we would still have incidents.
- Without the ability to know if a post-migration was executed, there could be some complexity around managing the dependency and follow-up code around them.
- We would need to ensure all the required post-migrations were executed in GitLab.com before including them in a self-managed release.
Option B: Post-migrations are executed only in specific coordinated pipelines
We could execute post-migrations only in specific coordinated pipelines, probably the ones created in APAC timezones.
Pros:
- Post-migration execution is limited to certain packages. All others would be suitable to rollback.
- Development velocity is not affected, the process for creating and merging an MR with a post-migration stays the same.
- Complexity around including post-deployment migrations into self-managed releases is reduced compared to Option A.
Cons:
- Post-migrations are still associated with the coordinated pipeline preventing the rollbacks when required, increasing the MTTP and the SLO apdex
- Executing them in APAC could help due to the low organic traffic this timezone has, but it might not scale in the future.
- We also have fewer people in this timezone so it could be harder to resolve incidents.
- This implies having some deployments with longer execution times, which is not desirable.
Option C: Post-migrations are executed at the end of the week.
We pile up post-migrations for a period (e.g. a week) and execute them in batch.
Pros:
- Post-migration execution is limited to once a week, all other packages are suitable for rollbacks.
Cons:
- This affects development velocity since delaying post-migrations could block features and engineering teams.
- Wrapping migrations to a single deployment per week will make such deploy risky, as it could lead to unnecessary DB pressure and/or production incidents.
- Batching changes goes against our goals to adopt continuous delivery practices.
Option D: Execute post-migrations in a specific schedule based on their nature.
Originally proposed on #1978 (comment 676080733)
We could classify post-migrations based on their nature and use this information to choose an appropriate time slot for its execution. Some examples (based on #1978 (comment 676080733))
- Large indexes or long foreign key creations => Run during low-traffic or no concurrent deploys window time
- Non-critical indexes and data migrations => Run daily
- DDL cleanups => Run on the weekend or even longer-lived batches (e.g at the end of each release)
Pros
- Classifying post-migrations based on their nature allows us to:
- Exert control over post-migrations and the operations they execute.
- Determine which post-migrations are safe to be rollback: e.g adding/removing columns should be considered unsafe, but adding indexes should be deemed safe.
- Executing them based on their content and on specific schedules removes the uncertainty around post-deploy migrations, Release Managers would have a better idea of what is going to be executed and the estimated execution time.
- Based on their operations, we can select the best schedules to execute them, e.g time-consuming operations can be executed during low-traffic, unblocking deployments to production.
- Benefits from Option A and Option C: Faster MTTP, all packages suitable for rollback, and more.
Cons
- Classification of post-migrations should be done during the development cycle by the developer. This implies a learning curve for developers when it comes to creating and scheduling post-migrations.
- Post-migrations duration and the content will be the same, so we would still have incidents.
- We’d need to ensure all the required post-migrations are executed in GitLab.com before including them in a self-managed release.
- For starters, triggering post-deploy migrations will be a manual task, which adds a cognitive load on Release Managers.
Option Selected E: Removal of post-deployment migrations from the deployment process
Originally proposed on #1978 (comment 674397821)
Post-deployment migrations are a design that is not supported by the Rails framework, they were originally introduced to be optionally executed after a deployment. With time, we have relied on these to perform operations that are required for the application to operate, convoluting our deployment and release processes. In the long-term, it’d be great if we could think of a future without post-deploy migrations, replacing them with a feature on GitLab.com, and removing them from the deployment process.
Pros
- Embedding post-deploy migrations into the product presents a dogfood opportunity.
- Complexities of the post-deployment migrations will be removed from the deployment process. Deployment and release processes are greatly simplified as a consequence.
- Benefits from Option A and Option C: Faster MTTP, all packages suitable for rollback, and more.
Cons
- Post-deploy migrations have been used for years, removing this aged process implies an extensive learning curve across multiple engineering departments
- This long-term effort will require the involvement of multiple stakeholders from different departments.
Recommendation
Note: This section was updated based on the discussion made on this issue
From: #1978 (comment 683640203)
Being able to remove post-deployment migrations entirely will depend on us knowing more about the types of migration and making improvements to the product and tooling to support. This will be a long-term outcome that we can work towards iteratively.
Option A)
Phase 1: Remove the blocking nature of post-deployment migrations (similar toPost-deploy migrations executed independently from the coordinated pipeline, the execution of the post-migrations will be a manual job at the discretion of Release Managers. We’d need to work towards increasing the visibility and the information of post-deploy migrations so Release Managers can use it to make deployment, rollback, and hot-patch decisions, and in general for incident visibility.
Executing post-deployment migrations independently from the coordinated pipeline has the benefit of reducing the deploy timings and increasing the number of packages suitable for rollback at the short-term cost of adding cognitive load to Release Managers.
Option D)
Phase 2: Make the execution of post-deploy migrations smarter (With post-migrations running independently from the coordinated pipeline, the next milestone would be to remove the cognitive load for Release Managers and start automatically executing post-deployment migrations based on their classification.
Phase 3: Remove post-deployment migrations from the deployment process.
At this point, with the information and processes established in the earlier phases, we should’ve reached the level of maturity to consider removing the post-migrations from the deployment process, and replacing them with a GitLab feature.
Previous version
Based on the above, I’d be inclined to adopt Option A: "Execute post-migrations independently from the coordinated pipeline" since it’d allow us to unbind the post-migrations from the auto-deploy process. To do so, we could do the following (very broad) steps:- Add a self-service command in ChatOps that allows engineers to verify if the post-migration was executed in GitLab.com
- Add a pipeline schedule that executes pending post-migrations at least twice a day, one in AMER another one in EMEA
- Remove post-migrations legacy jobs from the deployer pipeline.