Automatic rollback in case of failure
This feature is GitLab Ultimate tier
Problem to solve
As a developer, I want to make sure no faulty deployment is active on my production environment at all times, so I can make sure I am not exposing my users unnecessarily to a subpar experience due to external dependencies.
User experience goal
If there is a problem in the pipeline that deploys, it would be nice if the pipeline would perform an automatic rollback.
For this iteration (and this specific issue)
- Any critical alert on the environment will initiate a rollback.
- The user must opt-in to this feature (setting defined below in acceptance criteria)
- There will only be one rollback attempt on an alert (to avoid an endless loop of rollbaks)
- Add settings section for the user to configure Auto-rollback on/off to CI/CD project settings above
- Features a single checkbox with label and help text and a confirmation button.
- Rollback is to the last successful deployment (This will re-run the pipeline of the last successful deployment)
- Auto rollback must be logged in the audit log as an action done in the pipeline
- For the first iteration - any critical alert will trigger a rollback
- Confirmation is a primary success button
- Copy of settings UI:
Automatic deployment rollbacks Automatically roll back to the last successful deployment when a critical problem is detected. - [ ] Enable automatic rollbacks Automatic rollbacks start when a critical alert is triggered. If the last successful deployment fails to roll back automatically, it can still be done manually. More information
|Mockup (browser made)|
|Note: Copy might differ from the mockup, see acceptance criteria above|
Out of scope for this issue:
- Metrics will be defined on a a dedicated yml file.
- Only metrics defined in this YML file will initiate the auto-rollback similar to the common_metrics.yml. ./gitlab/dashbords.yml
Weight estimate: 3
backend: Create a worker to trigger an automatic rollback, should a deployment fail.
frontend: Create a switch in the project settings. Users should be able to choose whether rollbacks should be automatic or manual (as highlighted in link posted above)
How GitLab Re-deploy feature behaves today
- GitLab already knows the list of successful deployments on an environment.
- GitLab can deduce the latest successful deployment from the deployment list. (Let's say Deployment-A)
- GitLab can deduce the previous successful deployment from the deployment list and Deployment-A. (Let's say Deployment-B)
- When Deployment-B is re-deployed, GitLab creates Deployment-C. Deployment-B and C have the same metadata.
- We can check if a deployment was re-deployed:
project.deployments.where(ref: deployment.ref, sha: deployment.sha).exist?
How we will extend
- Auto Rollback happens only once when a new critical alert is raised.
- We cannot simply re-deploy Deployment-B because Deployment-A will be the previous successful deployment which is the next rollback target. This feature keeps deploying Deployment-A and Deployment-B alternatively. Probably that's not what we want. To illustrate:
- Deployment-C (latest, same content with Deployment-B)
- Deployment-A (previous successful)
- Deployment-B (previous previous successful)
- We need to persist the rollback history as the following.
deployments.auto_redeployed_by_id(FK) ... The ID of the deployment initiated the auto re-deploy.
- Let's say there are two deployments Deployment-A and Deployment-B and a critical alert is raised on the environment.
- What is the latest deployment? => Deployment-A
- What is the previous successful deployment? => Deployment-B
- Should GitLab re-deploy Deployment-B? => Yes
- Let's say a new critical alert is raised on the environment again.
- What is the latest deployment? => Deployment-C
- What is the previous successful deployment? => Deployment-A
- Should GitLab re-deploy Deployment-A? => No, because Deployment-C was triggered by A. Next.
- Should GitLab re-deploy Deployment-B? => No, because Deployment-C is identical with Deployment-B. Next.
- If there is a running deployment on the environment when a critical alert is raised, this feature won't do anything. (Please see the "Constant Rollback" below)
Which alert is considered as critical?
severitycolumn that takes
critical: 0, high: 1, medium: 2, low: 3, info: 4, unknown: 5.
criticalis the status of an alert that triggers an rollback.
Rollback range (Next iteration)
- Rollback is a useful operation to revert a problematic code, but it also has a risk to remove a valid code/feature that disturbs end-user's request.
- Operators should be able to set a rollback range to group backward compatible deployments. Auto Rollback should happen in this current range and shouldn't go across.
Constant Rollback (Next iteration)
- If a critical alert still exists after the X minutes from the Auto Rollback point. The next Auto Rollback will be triggered.
- Last version to rollback to (including gitlab-ci.yml file)
Spinnikar supports a similar feature that specifically designed for Rollback on Kubernetes's Rolling Update: