Availability Improvement: Automated or EOC initiated rollback of last deployment based on time correlation
Summary
2023-07-14: Intermittent apdex dips for web and... (production#16042 - closed) and https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24115+ could've benefited from an automated workflow that would halt/rollback deployments based on time correlation. Our bots have auto-started an incident while the deployment was in progress: production#16041 (closed).
Can EOCs initiate deployment rollback to speed recovery time? If not, why not?
Related Incident(s)
Originating issue(s): 2023-07-14: Intermittent apdex dips for web and... (production#16042 - closed)
Also https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24115+
Desired Outcome/Acceptance Criteria
Multiple ideas:
- When an auto-generated incident is opened during an ongoing deploy, the deploy should be halted automatically.
- When an auto-generated incident is opened immediately following a deploy, the deploy should be rolled back automatically.
- When the root cause of an incident is determined to be a software change in the most recent deploy, an EOC should be able to safely roll it back.
Associated Services
Corrective Action Issue Checklist
-
Link the incident(s) this corrective action arose from -
Give context for what problem this corrective action is trying to prevent re-occurring -
Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4') -
Assign a priority (this will default to 'Reliability::P4' but should match the severity of the related incident)
Edited by Michael Kozono