Make the execution of the post-deploy migration pipeline smarter.
## Problem statement The post-deploy migration (PDM) pipeline was introduced on https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/585 as a way to make all auto-deploy packages rollbackable by removing post-deploy migrations from the auto deploys. The PDM pipeline is executed manually on a daily basis at the discretion of release managers. Executing post-deployment migrations independently from the coordinated pipeline has the benefit of reducing the deploy timings and increasing the number of packages suitable for rollback but it has some downsides: * Post-deploy migrations contribute to database load. Given the execution of them is usually time-consuming, it leads to a database overhead if they're executed during a high-traffic period. * The PDM pipeline blocks deployment pipelines. To prevent upgrading a package while running post-deploy migrations, the PDM pipeline locks the respective environment during its execution, preventing deployments. * The PDM sets the line for rollbacks, executing it at the wrong time could lead to rollbacks being prevented. * Manually executing the PDM adds cognitive load to release managers since it is a task release managers need to remember to execute. See https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/1978 for a previous discussion ## Proposal Express the nature of a post-deploy migration, and based on this information, execute them at a suitable time. For example: 1. A data migration that only schedules sidekiq jobs (no locking problem): Run during the week 1. A very large index or foreign key creation: Run during the weekend (or any other low-traffic/no concurrent deploys window of time) 1. DDL cleanup: Weekend or even longer lived batches ## DRI TBD ## Exit criteria * Post-deploy migrations are classified based on their nature. * PDM is executed automatically at specific times based on the post-deploy migrations nature. * PDM considers deployments and rollback availability for its execution. ## Plan - [ ] Stage 1: Classify post-deploy migrations based on their nature - [ ] Stage 2: Execute the PDM pipeline based on the post-deploy migration nature - [ ] Stage 3: Rollout and final steps ## Issues * [#2136 - Post Deploy migrations safety indicator rollback](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/2136). * [Classify post-deploy migrations based on their nature](https://gitlab.com/gitlab-org/gitlab/-/issues/346604) ## Label admin Using the following labels when creating issues for this Epic: ``` /label ~"post-deploy migrations::phase 2" ~AutoDeploy ~"team::Delivery" ~"Delivery::P4" /epic &778 ``` ## Follow ups <details><summary>Details</summary> **From the first iteration:** * [#2490 - Move the detection of pending post migrations to the main stage](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/2490) * [#587 - Annotate pre and post-deployment migrations in Grafana](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/587) * [#2417 - Expand the incident template to include the execution of post-migrations](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/2417) **Nice to have** * [#2416 - Add the post-migration diff in the Slack threads](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/2416) * [#2334 - Build a Prometheus alert based on the pending post-deploy pipelines](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/2334) * [#2357 - Schedule post-deployment migrations as a job on k8s](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/2357) * [#2498 - Number of pending migrations on release manager dashboard considers main and ci databases](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/2498) * [#2504 - Proposal: Have a separate set of scoped labels for post-deployment migrations](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/2504) </details>
epic