Proposal: Document "root cause analysis" in master-broken incidents

Context

A master broken incident almost always implies that a MR pipeline was ✅ , and it shouldn't have been.

Outside of flaky tests, which are actively being worked on, it would be great to understand systematically why an MR pipeline was actually green. In other words, it would be great to understand (and document) the root cause of master broken incidents.

Having this data would then allow us to:

Understand what are the most current root causes, and work on preventing them first for biggest impact.
Justify outside the team what are the main causes of master-broken incidents
Have a dashboard with those master-broken incidents (after the second iteration is in place)

Proposal

First iteration

Add a Root Cause Analysis discussion to each master broken incidents in https://gitlab.com/gitlab-org/quality/engineering-productivity/master-broken-incidents/-/incidents (toy example here). This would allow us to document our findings without having to create a separate issue (example here).
- Those discussions thankfully sometimes happen organically in the incident (see master-broken-incidents#45 (comment 1159064802) for an excellent example). This proposal is to make those discussions more systematic.
Add instructions at the end of the master-broken workflow to document the root cause of the issue.

Second iteration

Come up with a list of root causes (like we did in gitlab-org/gitlab!101543 (merged)), and make labels out of them (e.g. master_broken::rca::infrastructure or master_broken::rca::static_analysis).
Add instructions at the end of the master-broken workflow to add an rca label where applicable.

Third iteration

Make charts/tables out of those labels in Sisense, so that we can identify trends, justify why master were brokens, ...

What happens when we cannot diagnose the root cause?

I think we shouldn't stop when we're stuck. Not being able to diagnose a root cause could come from various factors:

Lack of data (e.g. missing artifacts/cache)
Lack of tooling (e.g. hard to find specs that failed in a particular job after Knapsack split the codebase, find root cause MR, ...)
Lack of knowledge on particular application domain (e.g. low-level infrastructure issues, unfamiliar parts of GitLab, ...)
Complexity of parts of the pipeline codebase

I would propose having a label for those times as well: master_broken::rca::unknown, and we could then review those incidents periodically to try to understand what we could improve (e.g. tooling, documentation, ask specialists from other teams to help us diagnose, refactorings to simplify the codebase)

Edited Nov 04, 2022 by David Dieulivol