Proposal: When to enable predictive tests without full pipelines afterwards
Context
Docs on predictive pipelines: https://docs.gitlab.com/ee/development/pipelines/#predictive-test-jobs-before-a-merge-request-is-approved
- We are currently running predictive tests for frontend and backend as part of MRs pipelines by default. We are also running full pipelines as a safety net for both at least once the MR was first approved.
- master-brokentest-selection-gap issues are only the test selection gaps that have caused broken masters. The majority of test selection gaps happen in MRs pipelines (they don't make it to us because of the full pipelines we run after the first MR approval).
- CI/CD Pipelines are not free. The less we run them and in them, the better.
Goals
More short-term/mid-term goals:
- Have a predictive tests accuracy metric for MR pipelines as well, not just based on broken master master-brokentest-selection-gap incidents.
- Have a way to accurately label pipelines/job failures (flaky test, infrastructure problem, "something else")
Long term goals:
- Enable predictive tests without necessarily always having to run a full pipeline after it.
- We probably always will have to run full pipelines at some point. The key point is that we could do it less often if we were more confident about our predictive pipelines.
The proposal
Run only predictive pipelines in MRs if we have the following in place:
-
We know how to accurately classify a failed pipeline/job due to an infrastructure problem, a flaky test, or "something else" (an attempt was made in master-broken-incidents$2507197, but it's not accurate enough). -
From there, we have cleaned up the tables in https://app.periscopedata.com/app/gitlab/1116767/Test-Intelligence-Accuracy (i.e. the Predictive to Full transition failures
) to only contain predictive/full pipelines that failed because of a non-flaky and non-infrastructure issue. -
We created two separate metrics: MR pipelines predictive tests accuracy
andmaster pipelines predictive tests accuracy
. The first one would be solely based on MR pipelines that failed due to non-flaky/non-infrastructure issues, and the second metric would be based on the number of distinct master-brokentest-selection-gap like we have now. -
We have a both predictive tests accuracy metrics above a high threshold consistently over a few months (e.g. 99.5%)
Edited by David Dieulivol