Migrations can silently fail during deployments
This weekend's ops outage highlighted the fact that deployments can succeed while the associated migrations silently fail: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2555.
The TL;DR; of that incident is that a migration called for a Postgres extension to be installed, which the deployment user did not have permission to do. This caused migrations to fail for over a month before it caused a big enough problem that we had to go in and manually install the extension and re-run the migrations.
Ideally, a migration failure like this should cause the deployment to fail so that it can be investigated. If that's not desirable, there should at least be a loud notification to somebody that there is a failure.
issue