Skip to content

Many post deployment migrations failed recently

Context

Post Deployment Migration (PDM) is deployed almost daily flexibly by release managers.

Lately, there are more PDM failures, e.g. in the last month, there are already 6 PDM failures. Many of them fell into the categories of trying to lock a table or add a new constraint to some columns, and then failed. This kind of failures has some bad impact to:

  • Release managers:
    • Block deployments
    • Spend time debugging
    • Spend time waiting for the job run
    • Have to find the migration's owners, which may not be available at the time of running
    • Cannot run PDMs flexibly, but rather during low traffic time, which normally limit to late APAC and early EMEA (also because of engineer availability)
  • Database operations:
    • Hints of overload or a database issue
  • Customers:
    • Risk of running into the same issue as GitLab.com

This issue is created to keep track of the PDM failures, and discuss within the Release&Deploy team as well as with other teams (DBO, etc.) to find a solution to improve the situation. The improvements can be (but not limited) in the following directions:

  • Improve DB performance
  • Consult engineer teams to make safer and faster PDMs
  • Reconsider our PDM tooling/strategy, e.g. introduce a way to perform one specific migration at a time
Edited by Dat Tang