Introduce Root Cause Analysis process for unplanned upgrade stops
Goal
Create Root Cause Analysis process when a new unplanned stop was discovered to ensure that all corrective actions are done and improve test coverage.
Background
Each unplanned upgrade stop should be handled as an incident as it's very disruptive for customers. Any product or testing corrective actions to be treated as S1.
Example RCA: #423895
Proposed flow
- Discovered unexpected upgrade path error (Support)
- Raised RCA issue to identify: why it happens, how many customers are affected, update upgrade path to include unplanned stop to resolve immediate incident
- Raised follow up issues for: a) Engineering team - is there room for product improvement on catching this earlier? b) Test Platform team - analyze test gap why this wasn’t caught with existing testing.
High-level work overview
- Create detailed template for unplanned upgrade stops RCA
- Incorporate with Engineering and Support team for feedback
- Document RCA process
- Create dashboard for tracking upgrade stop RCA issues trend per time
- Additional - raise follow up for RCA process for migration errors that are not causing unplanned stop - example
https://gitlab.com/gitlab-org/gitlab/-/issues/449650
Results
After consistent collaboration with Engineering and Support teams, both RCA handbook page and RCA template are available at:
- https://handbook.gitlab.com/handbook/engineering/unplanned-upgrade-stop/
- https://gitlab.com/gitlab-org/gitlab/-/blob/master/.gitlab/issue_templates/rca_upgrade_stop.md?plain=1
For the dashboard for tracking upgrade stop RCA issues, created https://gitlab.com/groups/gitlab-org/-/boards/7607968 for listing current open RCA grouped by labels. For creating more complex Tableau dashboard, Create dashboard for tracking upgrade stop RCA ... (gitlab-org/quality/quality-engineering/team-tasks#2799) has been raised to revisit once the process has been used and we have more data.