Find (and possibly skip) flaky tests in MR pipelines
Context
https://gitlab-org.gitlab.io/gitlab/rspec/flaky/report-suite.json is generated on master scheduled pipelines only.
We use this data in our flaky test dashboard (internal link).
We do not have flaky tests data for MR pipelines.
Goal
- Have a SSOT for flaky tests, whether in MR pipelines or in
master
pipelines. - (Later on) Skip flaky tests in merge requests.
Motivations
- Predictive tests: To enable predictive pipelines without a full pipeline afterwards, we would need to know whether a pipeline failed due to a flaky test. We cannot do this unless we have a reliable SSOT for flaky tests - see proposal
- Pipeline duration: We have around 16% of jobs that are retrying failed specs in a new process due to flaky specs (source).
Once flaky test will cause around 6min of delay in an entire pipeline (empirical data for now).
Related work & docs
Related work
- Detect and keep track of flaky specs (6 years ago)
- Automatically exclude flaky tests from RSpec jobs (2 years ago)
- ci: Skip flaky tests automatically and allow to opt-out (1 year ago) )
- ci: Don't skip flaky tests automatically (8 months ago)
Docs
- https://docs.gitlab.com/ee/development/testing_guide/flaky_tests.html#automatic-retries-and-flaky-tests-detection
- Flaky tests dashboard (internal link)
- https://about.gitlab.com/handbook/engineering/quality/engineering-productivity/flaky-tests/
Proposed technical blueprint
- (Find) When a test is being retried in an MR pipeline, and passes, it is considered flaky. We can then create an issue for it (see gitlab-org/ruby/gems/gitlab_quality-test_tooling!77 (merged)).
- (Skip) In scheduled pipelines, create a list of flaky tests to skip from the list of flaky tests issues.
- (Skip) In MR pipelines, download this list of flaky tests issues, and skip the flaky tests.
Q&A
Should we automatically skip flaky tests in MRs?
TL;DR: Once we have a working SSOT for flaky tests, we should try again with caution (i.e. top 20/50/100 flakiest tests from certain spec classes that don't have a big coverage)
Longer take:
A few thoughts first.
Skipping is essentially temporary quarantining.
Automatically skipping ALL the flaky tests we find is like a massive quarantining campaign.
Just like quarantining a test, skipping tests will reduce test coverage. This hit us in the past (example), where we decided to stop skipping flaky tests to keep master clean.
I think we should apply the same caution with skipping as for quarantining:
- Consider skipping only a few tests at once. Below are a few possible approaches:
- Top 20/50/100 Flakiest tests first
- Consider test classes with a "smaller coverage" at first (e.g. don't skip feature specs, as they provide a big coverage - this is what happened in gitlab-org/gitlab#390448 (comment 1267230080)).
Thinking more about gitlab-org/gitlab#390448 (comment 1267230080), we stopped skipping tests because we had a master broken incident. I believe it was the right call to stop the all-or-nothing approach we had for skipping tests.
However, if we are skipping flaky tests "cautiously" as proposed above, I think we could go on with the flaky tests skipping, and take one incident after the other.
Also, we would need to have a working process for fixing those flaky tests (I think it would be easier to articulate once we have a SSOT for flaky tests with issues and a proper weight - coming in gitlab-org/ruby/gems/gitlab_quality-test_tooling!77 (merged))
How can we refer to a test accurately? Do we use line number, RSpec IDs, or something else?
Something else. We rely on a test hash, which is a hash of the spec file and the test name (see pros/cons).
Why create issues in GitLab? Why not have them in another data source?
In my view, for a few reasons:
- If we want to collect data from MR pipelines, we need a global storage to push the data to. Artifacts don't work out of the box for example, S3-like storage would work, but needs a bit more maintenance.
- Dogfooding
- We would eventually want to create flaky test issues afterwards anyways (see gitlab-org/gitlab#398692 (closed) for a proof of this)
- Issues have great features we could use:
- Issues give us timestamps for free when an action was taken (e.g. when the weight was incremented, when the issue was closed/reopened). This is effectively a time series that we could use for the entire lifecycle of a test (until it gets renamed or moved to another file, at which point we would create a new "time series").
- Issues can be closed/reopened, which indirectly give an "active/inactive" status for a given flaky test.
- Issues have weights, which would allow us to easily count the number of times a flaky test was spotted.
- Issues have labels (e.g. found:in MR, found:master, ...)