Make the execution of the post-deploy migration pipeline smarter.
## Problem statement
The post-deploy migration (PDM) pipeline was introduced on https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/585 as a way to make all auto-deploy packages rollbackable by removing post-deploy migrations from the auto deploys. The PDM pipeline is executed manually on a daily basis at the discretion of release managers.
Executing post-deployment migrations independently from the coordinated pipeline has the benefit of reducing the deploy timings and increasing the number of packages suitable for rollback but it has some downsides:
* Post-deploy migrations contribute to database load. Given the execution of them is usually time-consuming, it leads to a database overhead if they're executed during a high-traffic period.
* The PDM pipeline blocks deployment pipelines. To prevent upgrading a package while running post-deploy migrations, the PDM pipeline locks the respective environment during its execution, preventing deployments.
* The PDM sets the line for rollbacks, executing it at the wrong time could lead to rollbacks being prevented.
* Manually executing the PDM adds cognitive load to release managers since it is a task release managers need to remember to execute.
See https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/1978 for a previous discussion
## Proposal
Express the nature of a post-deploy migration, and based on this information, execute them at a suitable time. For example:
1. A data migration that only schedules sidekiq jobs (no locking problem): Run during the week
1. A very large index or foreign key creation: Run during the weekend (or any other low-traffic/no concurrent deploys window of time)
1. DDL cleanup: Weekend or even longer lived batches
## DRI
TBD
## Exit criteria
* Post-deploy migrations are classified based on their nature.
* PDM is executed automatically at specific times based on the post-deploy migrations nature.
* PDM considers deployments and rollback availability for its execution.
## Plan
- [ ] Stage 1: Classify post-deploy migrations based on their nature
- [ ] Stage 2: Execute the PDM pipeline based on the post-deploy migration nature
- [ ] Stage 3: Rollout and final steps
## Issues
* [#2136 - Post Deploy migrations safety indicator rollback](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/2136).
* [Classify post-deploy migrations based on their nature](https://gitlab.com/gitlab-org/gitlab/-/issues/346604)
## Label admin
Using the following labels when creating issues for this Epic:
```
/label ~"post-deploy migrations::phase 2" ~AutoDeploy ~"team::Delivery" ~"Delivery::P4"
/epic &778
```
## Follow ups
<details><summary>Details</summary>
**From the first iteration:**
* [#2490 - Move the detection of pending post migrations to the main stage](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/2490)
* [#587 - Annotate pre and post-deployment migrations in Grafana](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/587)
* [#2417 - Expand the incident template to include the execution of post-migrations](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/2417)
**Nice to have**
* [#2416 - Add the post-migration diff in the Slack threads](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/2416)
* [#2334 - Build a Prometheus alert based on the pending post-deploy pipelines](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/2334)
* [#2357 - Schedule post-deployment migrations as a job on k8s](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/2357)
* [#2498 - Number of pending migrations on release manager dashboard considers main and ci databases](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/2498)
* [#2504 - Proposal: Have a separate set of scoped labels for post-deployment migrations](https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/2504)
</details>
epic