Auto-retry jobs in merge requests when we detect flaky tests

Context

Merge Request pipelines are sometimes failing because of tests that are not part of the merge request changes.

We now have issues for tests that failed a CI job on the master branch for this project (example), and we're able to map a failing test to its test health issue.

Goal

Retry a job in a merge request pipeline that has an existing test health issue on master.

Technical thoughts

https://docs.gitlab.com/ee/ci/yaml/#retryexit_codes should be very helpful here: we could fail an RSpec job with a specific error code when we detect a known flaky test, and this job would directly be retried once or more (to be discussed/defined)
We could use another error code, and apply the same logic for infrastructure failures that we know retrying would help with
- Not all infrastructure issues are good to retry. If we have an active incident with 500 errors, it might make the problem worse to retry a lot of jobs.
Please use a CI/CD environment variable as a project variable to quickly enable/disable this new feature
There are a few issues in &8789 that we'll need to tackle first before we can rollout this one.

Edited Oct 01, 2024 by David Dieulivol