Find (and possibly skip) flaky tests in MR pipelines

Context

https://gitlab-org.gitlab.io/gitlab/rspec/flaky/report-suite.json is generated on master scheduled pipelines only.

We use this data in our flaky test dashboard (internal link).

We do not have flaky tests data for MR pipelines.

Goal

Have a SSOT for flaky tests, whether in MR pipelines or in master pipelines.
(Later on) Skip flaky tests in merge requests.

Motivations

Predictive tests: To enable predictive pipelines without a full pipeline afterwards, we would need to know whether a pipeline failed due to a flaky test. We cannot do this unless we have a reliable SSOT for flaky tests - see proposal
Pipeline duration: We have around 16% of jobs that are retrying failed specs in a new process due to flaky specs (source).

Once flaky test will cause around 6min of delay in an entire pipeline (empirical data for now).

Related work & docs

Related work

Detect and keep track of flaky specs (6 years ago)
Automatically exclude flaky tests from RSpec jobs (2 years ago)
ci: Skip flaky tests automatically and allow to opt-out (1 year ago) )
ci: Don't skip flaky tests automatically (8 months ago)

Docs

Proposed technical blueprint

(Find) When a test is being retried in an MR pipeline, and passes, it is considered flaky. We can then create an issue for it (see gitlab-org/ruby/gems/gitlab_quality-test_tooling!77 (merged)).
(Skip) In scheduled pipelines, create a list of flaky tests to skip from the list of flaky tests issues.
(Skip) In MR pipelines, download this list of flaky tests issues, and skip the flaky tests.

Q&A

Should we automatically skip flaky tests in MRs?

TL;DR: Once we have a working SSOT for flaky tests, we should try again with caution (i.e. top 20/50/100 flakiest tests from certain spec classes that don't have a big coverage)

Longer take:

A few thoughts first.

Skipping is essentially temporary quarantining.

Automatically skipping ALL the flaky tests we find is like a massive quarantining campaign.

Just like quarantining a test, skipping tests will reduce test coverage. This hit us in the past (example), where we decided to stop skipping flaky tests to keep master clean.

I think we should apply the same caution with skipping as for quarantining:

Consider skipping only a few tests at once. Below are a few possible approaches:
- Top 20/50/100 Flakiest tests first
- Consider test classes with a "smaller coverage" at first (e.g. don't skip feature specs, as they provide a big coverage - this is what happened in gitlab-org/gitlab#390448 (comment 1267230080)).

Thinking more about gitlab-org/gitlab#390448 (comment 1267230080), we stopped skipping tests because we had a master broken incident. I believe it was the right call to stop the all-or-nothing approach we had for skipping tests.

However, if we are skipping flaky tests "cautiously" as proposed above, I think we could go on with the flaky tests skipping, and take one incident after the other.

Also, we would need to have a working process for fixing those flaky tests (I think it would be easier to articulate once we have a SSOT for flaky tests with issues and a proper weight - coming in gitlab-org/ruby/gems/gitlab_quality-test_tooling!77 (merged))

How can we refer to a test accurately? Do we use line number, RSpec IDs, or something else?

Something else. We rely on a test hash, which is a hash of the spec file and the test name (see pros/cons).

Why create issues in GitLab? Why not have them in another data source?

In my view, for a few reasons:

If we want to collect data from MR pipelines, we need a global storage to push the data to. Artifacts don't work out of the box for example, S3-like storage would work, but needs a bit more maintenance.
Dogfooding
We would eventually want to create flaky test issues afterwards anyways (see gitlab-org/gitlab#398692 (closed) for a proof of this)
Issues have great features we could use:
1. Issues give us timestamps for free when an action was taken (e.g. when the weight was incremented, when the issue was closed/reopened). This is effectively a time series that we could use for the entire lifecycle of a test (until it gets renamed or moved to another file, at which point we would create a new "time series").
2. Issues can be closed/reopened, which indirectly give an "active/inactive" status for a given flaky test.
3. Issues have weights, which would allow us to easily count the number of times a flaky test was spotted.
4. Issues have labels (e.g. found:in MR, found:master, ...)

Edited Oct 19, 2023 by David Dieulivol