Shorten CI pipelines that would fail due to rspec failure

What is the GitLab engineering productivity problem to solve?

We want to give MR authors quick feedback if any of the following rspec tests fails:

tests written or changed as part of the MR
tests that test the files changed in the MR

Problem identification checklist

The root cause of the problem is identified.
The surface of the problem is as small as possible.

What are the potential solutions?

We can use the test-file-finder script, as a precursor to dynamic analysis test mapping. At this point, we are only reducing the tests if a failure occurs. The full test suite is still run for a complete pipeline. This is similar to the Verify/FailFast template in principle, but suited into the GitLab project.

Add a new job to run identified tests (similar to rspec foss impact, but to include ee) as a preceding stage to other rspec jobs
Store the test files that are executed as artifact for subsequent rspec jobs to exclude from their executions.

A simplified pipeline would be something like the following:

graph LR
    subgraph "rspec minimal<br />run tests for changed files";
        A["rspec impact"];
        B["rspec foss impact"];
    end

    subgraph "rspec full suite<br />run all remaining tests";
        D["rspec migration"];
        E["rspec unit"];
        F["rspec integration"];
        G["rspec system"];
    end

    A --> H
    H["artifact: test files executed"]


    H --exclude from--> D
    H --exclude from--> E
    H --exclude from--> F
    H --exclude from--> G

Update: After experimenting on @godfat-gitlab 's suggestion, we could run the impact test in parallel to other tests. We then cancel the pipeline when the impact test fails. This balances between saving cost from long running jobs, as well as avoiding additional pipeline duration.

Expected result

We can expect the following impact:

Reduced number of jobs being executed in failing pipelines, as there is less rspec jobs to be executed. This translates to reduction in average per pipeline CI cost.
There may be tradeoff in pipeline duration from adding a new job-stage but there would also be less rspec tests to run in the full suite.

Edge cases

Some MRs may have too many changes, causing a job timeout, similar to the case for rspec foss impact (#220883 (closed)). In this case, we don't need to fail the job, but pass through to the full suite to run all the tests.

Verify that the solution has improved the situation

The solution improved the situation.
- If yes, check this box and close the issue. Well done! 🎉
- Otherwise, create a new "Productivity Improvement" issue. You can re-use the description from this issue, but obviously another solution should be chosen this time.

Open questions:

What is the tradeoff in the pipeline duration?
Do we need to split the impact test into levels unit, migration, integration, system as well?
Do we need parallel jobs on the impact test?
Would this make the pipeline overly complex? How can we keep it simple?
How would this complicate metadata such as coverage, flaky tests, rspec metadata?
How would this be interoperable with master pipelines?

Edited Sep 02, 2020 by Albert Salim