Devise a way to reduce the problem of flaky tests in the future
We have a lot of feature specs, and these are quite fragile, especially when there are race conditions, and problems like these.
This leads to flaky tests.
It might be not possible to avoid transient failures after having some amount of tests in the repository. Some companies write much more unit tests because these are much more reliable.
The more feature tests we have, the more transient failures we are going to have. Should we start thinking about what we can improve in the process? If we have transient failures rate like 0.01% (which is really low!), with 15000 test examples, we will still have every pipeline under the red bar
I created this issue to encourage discussion about what can we do to still benefit from having CI pipelines even if we won't be able to avoid flaky tests.
The plan
Step 1 (10.0)
Developed https://gitlab.com/gitlab-org/gitlab-ce/tree/master/lib/rspec_flaky (could be extracted to a gem once stable/complete) that builds on top of rspec-retry
to detect flaky specs: !13021 (merged)
-
Detect flaky RSpec examples -
Retrieve the up-to-date list of currently tracked RSpec flaky examples ( retrieve-tests-metadata
job) from S3 -
Warn when new RSpec flaky examples are found in a non-master branch ( flaky-examples-check
job): for now we only warn (i.e. the job is allowed to fail) because we need to build a list of all the currently flaky specs to be sure that newly detected RSpec flaky specs are truly introduced by a branch rather than just missing from the flaky specs report file -
Update the list of currently tracked RSpec flaky examples on master
(update-tests-metadata
job) and upload it to S3. e.g. https://gitlab.com/gitlab-org/gitlab-ce/-/jobs/31688822/artifacts/file/rspec_flaky/gitlab-ce/report-master.json
-
%10.7)
Step 2 (-
Clean outdated flaky specs from the flaky specs report: #37721 (closed)
%10.8)
Step 3 (-
Forbid the flaky-examples-check
job to fail: #37720 (moved)
We need to clean up examples that are not flaky anymore from this file. The easiest way to do that is to clean them up based on their last_flaky_at
attribute.