CA: Prevent reintroducing regressions through rollbacks
Summary
While resolving production#19377 (closed), we decided to roll back to an older release to expedite resolution. Unfortunately, it was missed that this also undid a fix that was rolled out to resolve an unrelated incident.
We need to give IMs and EOCs better tools to make confident decisions during incident management so that these decisions do not lead to regressions.
Related Incident(s)
Originating issue(s): production#19377 (closed)
Reintroduced incident: production#19382 (closed)
Desired Outcome/Acceptance Criteria
EOCs and IMs can make informed decisions about whether a rollback is possible or a roll-forward is necessary so as not to interfere with other incident resolutions.
Currently when a rollback check is performed, we only check if rollback is possible.
In addition to that we also want to analyse the commits being rolled back in order to:
- Identify any security fixes or S1/S2 incident resolutions in the commits.
- Include warnings in the Slack notification if security or incident fixes would be rolled back.
Associated Services
ServiceWeb in Production Engineering in this case, but applies to general procedure of incident management.
Corrective Action Issue Checklist
-
Link the incident(s) this corrective action arose from -
Give context for what problem this corrective action is trying to prevent re-occurring -
Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4') -
Assign a priority (this will default to 'Production Engineering::P4' but should match the severity of the related incident) -
Assign a service label -
Assign a team label
