Shorten CI pipelines that would fail due to rspec failure
What is the GitLab engineering productivity problem to solve?
We want to give MR authors quick feedback if any of the following rspec tests fails:
- tests written or changed as part of the MR
- tests that test the files changed in the MR
Problem identification checklist
-
The root cause of the problem is identified. -
The surface of the problem is as small as possible.
What are the potential solutions?
We can use the test-file-finder script, as a precursor to dynamic analysis test mapping. At this point, we are only reducing the tests if a failure occurs. The full test suite is still run for a complete pipeline. This is similar to the Verify/FailFast
template in principle, but suited into the GitLab project.
- Add a new job to run identified tests (similar to
rspec foss impact
, but to includeee
) as a preceding stage to otherrspec
jobs - Store the test files that are executed as artifact for subsequent rspec jobs to exclude from their executions.
A simplified pipeline would be something like the following:
graph LR
subgraph "rspec minimal<br />run tests for changed files";
A["rspec impact"];
B["rspec foss impact"];
end
subgraph "rspec full suite<br />run all remaining tests";
D["rspec migration"];
E["rspec unit"];
F["rspec integration"];
G["rspec system"];
end
A --> H
H["artifact: test files executed"]
H --exclude from--> D
H --exclude from--> E
H --exclude from--> F
H --exclude from--> G
Update: After experimenting on @godfat-gitlab 's suggestion, we could run the impact test in parallel to other tests. We then cancel the pipeline when the impact test fails. This balances between saving cost from long running jobs, as well as avoiding additional pipeline duration.
Expected result
We can expect the following impact:
- Reduced number of jobs being executed in failing pipelines, as there is less rspec jobs to be executed. This translates to reduction in average per pipeline CI cost.
- There may be tradeoff in pipeline duration from adding a new job-stage but there would also be less rspec tests to run in the full suite.
Edge cases
- Some MRs may have too many changes, causing a job timeout, similar to the case for
rspec foss impact
(#220883 (closed)). In this case, we don't need to fail the job, but pass through to the full suite to run all the tests.
Verify that the solution has improved the situation
-
The solution improved the situation. - If yes, check this box and close the issue. Well done!
🎉 - Otherwise, create a new "Productivity Improvement" issue. You can re-use the description from this issue, but obviously another solution should be chosen this time.
- If yes, check this box and close the issue. Well done!
Open questions:
- What is the tradeoff in the pipeline duration?
- Do we need to split the impact test into levels
unit
,migration
,integration
,system
as well? - Do we need parallel jobs on the impact test?
- Would this make the pipeline overly complex? How can we keep it simple?
- How would this complicate metadata such as
coverage
,flaky tests
,rspec metadata
? - How would this be interoperable with
master
pipelines?