Introduce Root Cause Analysis process for unplanned upgrade stops
Goal
Create Root Cause Analysis process when a new unplanned stop was discovered to ensure that all corrective actions are done and improve test coverage.
Background
Each unplanned upgrade stop should be handled as an incident as it's very disruptive for customers. Any product or testing corrective actions to be treated as S1.
Example RCA: #423895
Proposed flow
- Discovered unexpected upgrade path error (Support)
- Raised RCA issue to identify: why it happens, how many customers are affected, update upgrade path to include unplanned stop to resolve immediate incident
- Raised follow up issues for: a) Engineering team - is there room for product improvement on catching this earlier? b) Test Platform team - analyze test gap why this wasn’t caught with existing testing.
High-level work overview
- Create detailed template for unplanned upgrade stops RCA
- Incorporate with Engineering and Support team for feedback
- Document RCA process
- Create dashboard for tracking upgrade stop RCA issues trend per time
- Additional - raise follow up for RCA process for migration errors that are not causing unplanned stop - example
https://gitlab.com/gitlab-org/gitlab/-/issues/449650
Edited by Nailia Iskhakova