Expand the suite of blocking E2E tests with automation

As per the latest Reliable Specs Report gitlab-org/gitlab#418701 (closed) we have 273 specs that are candidates to promote. Reliable tests are defined as being an end-to-end test that passes consistently in all pipelines, for at least 14 days.

We should make an effort to review and promote these tests to the reliable bucket as it doesn't seem to be something that we've been as active about doing recently as we have perhaps been in the past.

Motivation

In FY25Q1 the TTI team has the objective to Enhance reliability and efficiency in deployment pipelines by reducing end-to-end flaky test blockages to < 10% and a Key Result is to Develop automated de-quarantine processes to increase the reliable test suite pool.

For Q1 the remaining work to automate de-quarantining of reliable tests is the logical sequence to automation for promoting reliable E2E tests to the blocking bucket and quarantining of reliable tests that have been failing. These will help with our objective by adding more blocking tests that will provide a wider safety net for bugs and other possible test issues to not reach the deployment pipeline. This in turn will help reduce the number of hours of deployments blocked due to flaky tests.

Benefits

Typically only tests which are marked as smoke or reliable are run as part of the deployment pipelines. This is typically our last chance to catch any faults in a new build prior to it being deployed to production. Maximising the coverage of these tests, while balancing pipeline duration is important to ensure we have confidence in our latest builds, while also ensuring we can deploy rapidly.
We want to find ways to allow us to block MRs when pipelines fail prior to merging to master #1804 - however the impact of flaky tests on our pipelines has led to us needing to mark the jobs as 'allowed to fail' which has led to cases where valid bugs have been detected by the E2E suite, but were overlooked during the MR.
As an effort to move back to not allowing jobs to fail, we could consider first running the reliable test suite and blocking if it fails in an MR. This can be achieved today as is, but being able to expand the coverage of these jobs by promoting the above specs could be very beneficial to maximising coverage during MRs, while we work towards hardening the remainder of the suite.

Problems to Consider

Promoting large batches of tests to reliable in a single large sweep risks pipeline stability, if our heuristics to measure 'reliability' isn't accurate enough - flaky failing reliable tests will
1. block pipelines
2. reduce confidence in our E2E test suite
3. cause friction for developers who will be slowed down reviewing flaky test failures
Issues such as Explore Options for Running Full QA Suite on St... (#1898 - closed) highlight that this report may include specs that just haven't run against live environments e.g. staging so we mightn't really have any good visibility on what tests are or are likely to be candidates to promote reliable in this environment right now. Do we need to re-evaluate how we define reliable tests, especially given that we don't run tests against live environments as often as we used to so we don't have visibility on how reliable they actually are until we go ahead and promote.
It don't think it's something that we actively planned on, but the size of the smoke + reliable suite today seems to be at a size that seems acceptable for deployments - expanding this suite may have an overall impact on the duration of deploys, in which case we may need to consider how we determine what set of tests to run against deployments. If this pipeline duration on deploys is a concern, it's something we should do irregardless, and perhaps use a :deploy metadata on tests to target these tests, and retain the original meaning for :reliable metadata for what it's intended purpose is.
Manually tracking the state of tests and updating their metadata based on their reliability status takes quite a bit of manual intervention - we might want to invest in improving the automated approaches to determining and marking the reliability of individual E2E tests.

Original Context from Quality Engineering team meeting week 2023-07-17

John McDonnell Reliable Tests as a stepping stone - the Reliable Spec Report is reporting 273 specs that are candidates to promote. Any reason we aren't promoting these? It would help bump coverage by quite a bit, while possibly avoiding the pitfalls of flaky E2Es that are the ones that cause hassle.
1. John McDonnell I do think that perhaps 14 days isn’t long enough to fully determine that a test is ‘reliable’, so perhaps a subset of the above that have been ‘reliable’ for 31 days might be a good starting point.
2. Sanad Liaquat We need to be more disciplined about actioning on the reliable spec report. Would it be helpful to have automation to create an MR to move tests in and out of the reliable bucket?
3. Harsha Muralidhar I think there needs to be a review about how we identify reliable tests. Environments like staging/pre-prod cause failures which are related solely to the environment, for example: gitlab-org/gitlab#419219 (closed) and nothing to do with the test itself. Failure rate only on master/QA on GDK should be considered to deem a test as reliable IMO.
4. Sanad: Worth raising an issue to discuss this further. Sometimes you need to update test and run reliable on live envs.
5. Andrejs Cunskis Building on top of what Harsha mentioned, if we do start running “reliable” tests in mr's and not allow to fail, we will increase the amount of incidents when that reliable bucket becomes larger and start blocking deployments due to failures on staging that don't manifest themselves in isolated environments like omnibus and gdk
6. John McDonnell - might be pedantic, but if a test in the reliable bucket fails due to something other than an application issue – it’s flaky, not reliable. Defining-a-reliable-test clearly points out that to be ‘reliable’ it must pass in all pipelines so we’d need to revisit our understanding and reasoning for why we even have reliable tests if we want to start to make this distinction between tests that work well in main vs staging.

Scope

Reliable test report and data improvements
Automated promotion to reliable/blocking bucket
Automated Quarantining
Automated De-quarantining
Update reliable promotion and quarantining process documentation

Proposal

Phase 1 - Improve data in existing reliable test report for accuracy

Increase data for determining reliability of tests on Staging by running the full e2e test suite more often:
- After a feature flag is turned on => gitlab-com/chatops!393 (merged)
- After deployment to staging-canary => gitlab-org/release-tools!2564 (merged)
Save test failure exception to influxdb for later use in reliability report => gitlab-org/gitlab!128738 (merged)
Display failure exceptions on reliable report issue => gitlab-org/gitlab!130276 (merged)
Update rules for determining reliability - ignore failures with certain failure exceptions/reasons. => gitlab-org/gitlab!133598 (merged)

Phase 2 - Automate quarantining and promotion to the `blocking` bucket (OKR)

Implement mechanism to add rspec meta to tests => gitlab-org/ruby/gems/gitlab_quality-test_tooling!117 (merged)
Automate promotion to reliable AND quarantining of reliable issues that are now deemed un-reliable
- Copy test_failure to test-metrics-*.json file => gitlab-org/ruby/gems/gitlab_quality-test_tooling!102 (merged)
- Send failure_issue to InfluxDB as part of test metrics => gitlab-org/quality/pipeline-common!377 (merged)
- Export reliability data to json file => gitlab-org/gitlab!139742 (merged)
- Add/Update rules in reliable test report for Add rules for auto quarantine and promotion to reliable/blocking => gitlab-org/gitlab!141620 (merged)
- Use the automated mechanism to create MRs to add quarantine meta to tests => gitlab-org/quality/toolbox!143 (merged)
- Use the automated mechanism to create MRs to add blocking meta to tests => gitlab-org/quality/toolbox!143 (merged)

Some notes regarding phase 2

The automation focuses on the output of the reliable test report by adding the :blocking meta to tests which works in MRs only (and not the :reliable meta which blocks deployments).
Initially we will create MRs for promotion to reliable in batches of 10 tests per reliability report sorted by most number of runs and increase this number as we are more confident with the effort.
We only consider the failures in e2e-package-and-test, e2e-test-on-gdk and nightly pipelines when deciding to quarantine a reliable test.
We only consider the failures that have a failure rate of equal or greater than 1 percent when deciding to quarantine a reliable test.
This effort was mainly focused on E2E tests since lower level tests do not have the concept of reliable/blocking bucket. However, the code has been written to be generic and can me used on any level of tests if required.

Phase 3 - Automate de-quarantining, update process and docs (OKR)

Run quarantined tests regularly on master pipeline to collect metrics => gitlab-org/gitlab!143542 (merged), gitlab-org/gitlab!143823 (merged)
Add a section to the reliable report that lists the (:reliable ?) tests in quarantine that have proven to be reliable and can be un-quarantined.
Implement automated mechanism that uses the data from the reliable test report to automatically create an MR to un-quarantine a (:reliable?) test.
Update and remove the process documentation for demotion from reliable. Once a test is moved to reliable bucket, it can only be quarantined and not demoted. => MR1, MR2, MR3

Edited Mar 26, 2024 by Sanad Liaquat