Ensure SREs Can Determine if Deployed Registries Can Be Safely Reverted to a Previous Version
Context
During 2023-05-12: Container registry pulls failing wi... (gitlab-com/gl-infra/production#14260 - closed) there was confusion on whether the registry version could be safely rolled back due to a possibility of database migrations already being executed. We should provide a way for SREs to check for this quickly.
From: gitlab-com/gl-infra/production#14263 (comment 1390262977)
Possible Solutions
Major Version Bumps for Non Rollbackable Migrations
This approach would have the container registry team increment the major version number of the registry each time we determine that it's not possible to revert safely to a previous registry version. It would be a manual process, but would allow a quick reference for the SRE without needing to check the output of tooling.
Detect Rollbackable Migrations via Tooling
The Migration Tool could be enhanced to validate whether it's safe to revert to a previous registry version.
Solution
Improve the processes and best practices by clearly highlighting which changes can be reverted by the SRE team:
- 
(Documentation/Process) Introduce an MR template (similar to https://gitlab.com/gitlab-org/gitlab-pages/-/blob/master/.gitlab/merge_request_templates/Default.md) where every merge request must specify (using checkboxes and descriptions): - 
if it introduces changes that does not allow rolling back to a previous version of the registry due to migrations (note: all other reasons for not being able to rollback will be sufficiently captured by major version changes). 
- 
The reasoning behind why a release containing the presented MR can not be rolled back due to migrations 
- 
What steps need to be carried out in cases where a direct revert to an earlier version should not be attempted (due to the migration changes introduced in a release version) yet a feature (or features) that accompanied the release need to be reverted or turned of. 
- 
That the requestor has labelled the MR with a specific tag that indicates it is "non-revertable"/"revertable" for migration reasons (on Gitlab.comgprd, pre and gstg deployments for now)
- 
We should also make sure that if we do have a non-revertable MR, it's released in a version that only contains the changes from the MR 
- 
For any MR that has a migration, we should have checkboxes that it's been tested that it's safe for cny deploys and for rollbacks, with links to documentation on how to do that. A manual version of the automation description in your comment above is a great place to start with that documentation, but it doesn't have to exist right away. 
 
- 
- 
(Process) In the release plan all merged changes need to be identified as "revertable"/"non-revertable" by being tagged as such. The same tags need to be manually propagated to all version bump MRs. (By propagating the tag we guarantee to the version bump assigned merger (or "reverter") that the version can be safely rollbacked or not without the risk of db migration inconsistencies). 
- 
(Documentation/Process) Communicate this new tag and its meaning to the SRE teams (In the registry runbook) 
- 
(Documentation/Process) A checkbox to make us think about whether the change is safe to roll out to cny first and only then to the main stage. 
In a future iteration we could also consider building in Automation/Bot-reminders
- 
(Automation) so that the tag indicating the MR is "non-reverteable" is propagated to all version bump MRs as well as the release plan issue. 
- 
(Bot reminder) to make sure the MR template has been filled