Create a flaky tests SSOT outside of gitlab-org/gitlab

Problem

We now that flaky tests are one of our biggest problems, in terms of:

master stability
waste of productivity (engineers have to retry jobs manually)
slowing MTTP (Mean Time To Production) and MTTR (Mean Time To Recovery)
compute CI (retried jobs adds cost)

One efficient way to solve that is to quarantine problematic tests. Quarantining a test currently means adding a quarantine: metadata to the RSpec example.

The problem is that this requires to modify the main GitLab code base, and this triggers pipelines that can take up to 60 minutes to run.

In the context of a Production Incident Recovery, this can mean hours of delay, just to exclude one test from the test suite.

Proposal

See #204 (comment 1363963068).

Abandoned proposal

Click to expand

The proposal is to make the quarantining process way lighter by keeping the SSOT of quarantined tests in a separate place/project.

The idea is to:

Create test cases automatically in https://gitlab.com/gitlab-org/gitlab/-/quality/test_cases for failing tests, similarly to how we started to experiment with https://gitlab.com/gitlab-org/quality/engineering-productivity/flaky-tests-playground/-/issues/?sort=weight_desc&state=opened&first_page_size=100
When a test case is detected as flaky (manually for now), just add the quarantine label to it.
The associated webhook event would be received by triage-ops, which would trigger a pipeline in a separate project, e.g. gitlab-org/quality/engineering-productivity/flaky-tests
The gitlab-org/quality/engineering-productivity/flaky-tests pipeline would compute a JSON file (a bit similar to https://gitlab-org.gitlab.io/gitlab/rspec/flaky/report-suite.json) with the list of quarantined Test cases, and other metadata if the quarantine needs to be applied to specific environments/branches etc.
The JSON file would then be uploaded as Pages so that it can be fetched reliably by various pipelines in an non-authenticated way (compared to using the API)

flowchart LR
  A1 -.->|Webhook is sent to triage-ops.gitlab.com| A2
  subgraph gitlab [gitlab-org/gitlab]
    A1>Test case state change]
  end
  
  D2 -.->|Trigger pipeline| gitlab-org/quality/engineering-productivity/flaky-tests
  subgraph triage-ops [triage-ops.gitlab.com]
    A2[TestCaseUpdate handles the event]
    B2{`quarantine` label was added or removed?}
    D2["Trigger pipeline"]
    D21["Do nothing"]

    A2 --> B2
    B2 -->|yes| D2
    B2 -->|no| D21
  end

  subgraph gitlab-org/quality/engineering-productivity/flaky-tests [Flaky tests project]
  end

Then in gitlab-org/gitlab pipelines, we'd download the JSON file, and skip running tests that are quarantined.

The nice thing is that we already have the code to automatically skip tests based on a report since we used to do it based on https://gitlab-org.gitlab.io/gitlab/rspec/flaky/report-suite.json, so it’d be a matter of allowing to quarantine more tests manually by using the Test cases tracker.

Pros

The main advantage is that quarantining a test would be super fast: apply a label to a Test case, and wait for the Flaky tests JSON report to be updated in gitlab-org/quality/engineering-productivity/flaky-tests.
Using Pages, we're almost certain that the JSON file is always available
We would dogfood the Test Cases feature and improve it as a result
We would group unit/integration/system and E2E test cases under https://gitlab.com/gitlab-org/gitlab/-/quality/test_cases
We would also group quarantined tests data in a single JSON file
We could still use the confiner tool, but it would only add labels to test cases (which is more simple than trying to find a test in a file, add metadata to it and open an MR in the main project)

Cons & challenges

The SSOT would be farther away from the code base so we should make sure this data is easy to find for groups so that they know which of their tests are quarantined at any moment
Since the quarantine state wouldn't be in the codebase anymore, that means a flaky test could be removed from the codebase in the latest master but would still need to be quarantined in the previous *-stable-ee branch
- The same would apply for a flaky test that would be un-quarantined in master but would still need to be quarantined in the previous *-stable-ee branch
- Either we'd need to make sure such fix/removal are backported to stable branches, or keep flaky test issues around for at least current release + 3 months
What about file renaming or test description update?
- These would be challenging if the identification of a test is based on the filename / test description combination (and even if it's only based on the test description)
- In these cases, tests would be immediately un-quarantined unexpectedly

Edited Apr 27, 2023 by Rémy Coutable