Rollback Discovery for auto-deploy (Dedicated/GET tooling)
Problem Statement
We do not currently have a rollback recovery method for non-Geo installations using the Dedicated Tooling.
Existing reference material:
There's a set of jobs in switchboard UAT that make reference to rolling back: https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/sandbox/switchboard_uat/-/blob/4685a9f12aff8596b384deb05f319be500d3995f/templates/switchboard.yml#L333-427
The pipeline has a slightly different setup for these jobs:
Overview
- The tenant is failed over to its Geo counterpart
- The previous sites' database is removed
- The database is recreated as empty and configured to sync with the secondary site
- The tenant is failed back over to the original Primary site
Research
This method of rolling back strictly relies on the usage of Geo. For Cells, Geo is not currently planned for the initial iteration. Thus the current method attempting to be leveraged by Dedicated is insufficient for the needs of .com. Tests performed by Dedicated are manual, and take upwards of 30+ minutes. We need this to be automated because there's a chance we'd need to rollback a multitude of Cells. The recovery of the primary site would benefit from not dropping a potentially giant database. This increases risk as DR would be limited during an incident (worst case thinking, overoptimization). While ultimately the desire is to ensure we leverage stable versions at all times, we will have at least a few Cells that are within the first couple of Rings of our Deployment mechanism for which this may not be the case. We'll need a method of returning a Cell back to a working state whenever we perform a rollback on our current Main Stage of Production or recovering from an incident if a deploy fails on any random Cell assigned to any Ring.