[Spike] Evaluate what, when, and where to run tests

What, when, and where should we test?

Context

In light of the very frequent pipeline failures we've been seeing for the past several years, alongside their extensive cloud resource utilization, I've put together this issue to capture some thoughts on the topic along with at least one idea to make progress toward addressing it. I'm very much open to alternatives here and would love to hear any feedback that helps us make this less of a burden for our time while also improving the quality of our product and the speed with which we can develop it.

Overview

One of the recurring problems we face on the team is pipeline failures.

In this issue, let's reconsider our testing setup by looking at its current state and see if there are any improvements that can be made.

The overarching question is: what, where, and when should we test?

Symptoms

Here's a high-level summary of the problems we currently face:

Pipelines frequently fail. A strong majority of the time, these failures are unrelated to recent changes and are instead due to flaky tests, transient networking errors, etc.
- Examples: Charts pipeline failure issues
- Similarly, we often see failures on the master branch because its review environments are long-lived. This means pipelines can interact with the environments out of order (when certain pipelines fail, etc.) which leaves the environments in an unstable and often broken state. This is further exacerbated by backup and restore tests.
Pipeline failures regularly involve several team members to debug, retry, and log the results. This takes time away from important work.
A strong majority of the time, these
Pipelines are long (often around 1 hour), which impacts MR development cycles as well as time spent triaging failed pipelines.

This has led to issues such as:

For more examples, see Charts maintenance::pipelines issues.

Interpretation

One interpretation of the symptoms above is that testing is a significant factor in the problems we face today.

To inventory the tests currently happening, using Charts CI as an example (as its pipelines seem to fail most often), we have:

Tests under spec/
- Some run against Kubernetes clusters with live GitLab installations, including testing the backup and restore process
- Some run without clusters, instead testing the Helm Charts template outputs
QA tests that interact with live GitLab installations, testing functionality by interacting with the instance in a headless manner

This applies for both the Charts project and the Operator project, which inherits most of these tests.

Which raises a question: where and when should these tests happen?

Currently, most of these test happen on every pipeline: MRs, master, stable, and tags. Do they truly need to happen that frequently, given the types of changes we're testing?

I think it could make more sense to run these tests more strategically, ensuring that they reveal regressions properly without applying undue time and maintenance burden.

Potential solutions

Below let's outline some potential solutions to these specific problems.

Idea 1: separate tests from Charts project CI into a separate QA project

Let's consider moving some of these tests into a separate, downstream project to be owned by QA that can run consistent sets of tests against each of our installation methods.

Action items

Create a new subgroup under QA ownership focused on comprehensive testing of Distribution installation methods.
In this subgroup, create a project for each installation method (Omnibus, Charts, Operator, etc.).
In each new downstream QA project, create CI that runs interactive spec tests (such as backup and restore) as well as QA tests.
In each upstream project, remove these tests from their CI, but maintain the vcluster smoke tests and the tests that don't interact with a live cluster.

Impact

The testing flow would end up looking something like this:

gitlab-org/gitlab runs tests that confirm changes to this codebase are functional (related: #1368 (comment 1598247410)).
Distribution's installation method projects (Omnibus, Charts, Operator, etc.) run test that confirm changes to the installation methods are functional (for example, Charts would run templating tests and a simple smoke test to a cluster).
Finally, QA tests each of the installation methods fully. This includes advanced specs like backup and restore, as well as the full suite of QA tests. These should find any problems in the entire suite, which includes the application itself along with the relevant installation method.

As a result, we see the following impacts:

Distribution pipelines become shorter and more focused. When Distribution pipelines fail, Distribution is most likely the correct team to fix it.
QA pipelines can be run in a similar fashion against each distribution method for better consistency. When QA pipelines fail, the QA team is most likely the correct team to fix it.
Cloud resource utilization is reduced by isolating these tests to only the scenarios where it makes sense to run them (related: gitlab-org/charts/gitlab#5013 (closed).
The surface area for transient failures to occur is minimized.
Issues such as Proposal to have process to identify and include testing tech debts in our various Distribution projects can be addressed in a single-purpose repository without needing to implement changes amongst other unrelated code.

Idea 2: keep project structure as is, and reconsider only when to run tests

In this proposal, we can leave the project structure as it is in an effort to keep easier traceability between failures and their associated changesets. If we were to separate tests out to a separate project, it understandably becomes harder to trace back which change (or set of changes) caused the failure.

Instead, we can look into reevaluating when we run the tests. Currently, they happen on effectively all branches. However, since MR pipelines use the merged result with master, running the tests again on master can be seen as redundant.

This could also apply to stable branches, which typically only contain commits to master (which have previously been tested). But we'll need to evaluate this to be sure if this is always the case.

It's the failures on master that cause the most 'pain' to the team, because these failures create Issues that our DRI (and often other team members) need to address manually. This takes considerable time away from other deliverables. Additionally, it's often unproductive given that the failures are almost always transient/flaky because the changeset has already been proven to succeed in the MR pipeline that introduced it.

Impact

Project structure stays the same.
Issues are labeled with ~failure::* and allows us to more easily identify common root causes of pipelines failing.
The changeset introducing a failure remains easily traceable.
We run tests only when necessary, freeing up cloud resources, reducing pipeline runtime, and minimizing the number of new issues for DRIs to triage.

References

FY24Q3 Distribution Pipeline Improvements

Acceptance criteria

One or more of the above ideas are selected.
Issue(s)/Epic(s) created to implement idea(s).

Edited Dec 01, 2023 by Mitchell Nielsen