Automatic rollback in case of failure

Release notes

Problem to solve

As a developer, I want to make sure no faulty deployment is active on my production environment at all times, so I can make sure I am not exposing my users unnecessarily to a subpar experience due to external dependencies.

Intended users

User experience goal

Proposal

If there is a problem in the pipeline that deploys, it would be nice if the pipeline would perform an automatic rollback.

For this iteration (and this specific issue)↕

Any critical alert on the environment will initiate a rollback.
The user must opt-in to this feature (setting defined below in acceptance criteria)
There will only be one rollback attempt on an alert (to avoid an endless loop of rollbaks)

Acceptance criteria

Add settings section for the user to configure Auto-rollback on/off to CI/CD project settings above Deploy freezes

Features a single checkbox with label and help text and a confirmation button.

Rollback is to the last successful deployment (This will re-run the pipeline of the last successful deployment)
Auto rollback must be logged in the audit log as an action done in the pipeline
For the first iteration - any critical alert will trigger a rollback
Confirmation is a primary success button
Copy of settings UI:

Automatic deployment rollbacks

Automatically roll back to the last successful deployment when a critical problem is detected.

- [ ] Enable automatic rollbacks
      Automatic rollbacks start when a critical alert is triggered. If the last successful deployment fails to roll 
      back automatically, it can still be done manually. More information

Mockup (browser made)

Note: Copy might differ from the mockup, see acceptance criteria above

Out of scope for this issue:

Metrics will be defined on a a dedicated yml file.
- Only metrics defined in this YML file will initiate the auto-rollback similar to the common_metrics.yml. ./gitlab/dashbords.yml

Engineering scope

Weight estimate: 3

backend: Create a worker to trigger an automatic rollback, should a deployment fail.
frontend: Create a switch in the project settings. Users should be able to choose whether rollbacks should be automatic or manual (as highlighted in link posted above)
Documentation guidelines

Technical proposal

How GitLab Re-deploy feature behaves today

GitLab already knows the list of successful deployments on an environment.
GitLab can deduce the latest successful deployment from the deployment list. (Let's say Deployment-A)
GitLab can deduce the previous successful deployment from the deployment list and Deployment-A. (Let's say Deployment-B)
When Deployment-B is re-deployed, GitLab creates Deployment-C. Deployment-B and C have the same metadata.
We can check if a deployment was re-deployed: project.deployments.where(ref: deployment.ref, sha: deployment.sha).exist?

How we will extend

Auto Rollback happens only once when a new critical alert is raised.
We cannot simply re-deploy Deployment-B because Deployment-A will be the previous successful deployment which is the next rollback target. This feature keeps deploying Deployment-A and Deployment-B alternatively. Probably that's not what we want. To illustrate:
- Deployment-C (latest, same content with Deployment-B)
- Deployment-A (previous successful)
- Deployment-B (previous previous successful)
We need to persist the rollback history as the following.
- deployments.auto_redeployed_by_id (FK) ... The ID of the deployment initiated the auto re-deploy.
Let's say there are two deployments Deployment-A and Deployment-B and a critical alert is raised on the environment.
- What is the latest deployment? => Deployment-A
- What is the previous successful deployment? => Deployment-B
- Should GitLab re-deploy Deployment-B? => Yes
Let's say a new critical alert is raised on the environment again.
- What is the latest deployment? => Deployment-C
- What is the previous successful deployment? => Deployment-A
- Should GitLab re-deploy Deployment-A? => No, because Deployment-C was triggered by A. Next.
- Should GitLab re-deploy Deployment-B? => No, because Deployment-C is identical with Deployment-B. Next.

anti-race condition

If there is a running deployment on the environment when a critical alert is raised, this feature won't do anything. (Please see the "Constant Rollback" below)

Which alert is considered as critical?

environment.alert_management_alerts has severity column that takes critical: 0, high: 1, medium: 2, low: 3, info: 4, unknown: 5. critical is the status of an alert that triggers an rollback.

Rollback range (Next iteration)

Rollback is a useful operation to revert a problematic code, but it also has a risk to remove a valid code/feature that disturbs end-user's request.
Operators should be able to set a rollback range to group backward compatible deployments. Auto Rollback should happen in this current range and shouldn't go across.

Constant Rollback (Next iteration)

If a critical alert still exists after the X minutes from the Auto Rollback point. The next Auto Rollback will be triggered.

Future iterations

Last version to rollback to (including gitlab-ci.yml file)

Further details

Spinnikar supports a similar feature that specifically designed for Rollback on Kubernetes's Rolling Update:

Description	Screenshot
A user Can configure which version to rollback to (for example last version deployed to production)
We should note that for proper rollback, we need to rollback	Kubernetes Config Docker images gitlab-ci.yml
View of the Kuberenetes Clusters

Automatic rollback in case of failure

Release notes

Problem to solve

Intended users

User experience goal

Proposal

Acceptance criteria

Engineering scope

Technical proposal

Which alert is considered as critical?

Rollback range (Next iteration)

Constant Rollback (Next iteration)

Future iterations

Further details

Permissions and Security

Documentation

Availability & Testing

What does success look like, and how can we measure that?

What is the type of buyer?

Is this a cross-stage feature?

Links / references