Availability Improvement: Automated or EOC initiated rollback of last deployment based on time correlation

Summary

2023-07-14: Intermittent apdex dips for web and... (production#16042 - closed) and https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24115+ could've benefited from an automated workflow that would halt/rollback deployments based on time correlation. Our bots have auto-started an incident while the deployment was in progress: production#16041 (closed).

Can EOCs initiate deployment rollback to speed recovery time? If not, why not?

Related Incident(s)

Originating issue(s): 2023-07-14: Intermittent apdex dips for web and... (production#16042 - closed)

Also https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24115+

Desired Outcome/Acceptance Criteria

Multiple ideas:

  • When an auto-generated incident is opened during an ongoing deploy, the deploy should be halted automatically.
  • When an auto-generated incident is opened immediately following a deploy, the deploy should be rolled back automatically.
  • When the root cause of an incident is determined to be a software change in the most recent deploy, an EOC should be able to safely roll it back.

Associated Services

Corrective Action Issue Checklist

  • Link the incident(s) this corrective action arose from
  • Give context for what problem this corrective action is trying to prevent re-occurring
  • Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4')
  • Assign a priority (this will default to 'Reliability::P4' but should match the severity of the related incident)
Edited by Michael Kozono