Many post deployment migrations failed recently

Context

Post Deployment Migration (PDM) is deployed almost daily flexibly by release managers.

Lately, there are more PDM failures, e.g. in the last month, there are already 6 PDM failures. Many of them fell into the categories of trying to lock a table or add a new constraint to some columns, and then failed. This kind of failures has some bad impact to:

Release managers:
- Block deployments
- Spend time debugging
- Spend time waiting for the job run
- Have to find the migration's owners, which may not be available at the time of running
- Cannot run PDMs flexibly, but rather during low traffic time, which normally limit to late APAC and early EMEA (also because of engineer availability)
Database operations:
- Hints of overload or a database issue
Customers:
- Risk of running into the same issue as GitLab.com

This issue is created to keep track of the PDM failures, and discuss within the Release&Deploy team as well as with other teams (DBO, etc.) to find a solution to improve the situation. The improvements can be (but not limited) in the following directions:

Improve DB performance
Consult engineer teams to make safer and faster PDMs
Reconsider our PDM tooling/strategy, e.g. introduce a way to perform one specific migration at a time

Edited Sep 18, 2025 by Dat Tang