Audit of end-to-end test runs in various environments

Audit of E2E runs

The goal of this analysis is to be more efficient and reduce load/noise for SETs on call.

Useful links:

Environments: https://about.gitlab.com/handbook/engineering/infrastructure/environments/
GitLab.com deployments process: https://about.gitlab.com/handbook/engineering/deployments-and-releases/deployments/#gitlabcom-deployments-process
Different between canary and main stages: https://about.gitlab.com/handbook/engineering/infrastructure/environments/canary-stage/#what-is-different-and-what-is-shared-between-canary-and-main-stages
E2E test pipelines: https://about.gitlab.com/handbook/engineering/quality/quality-engineering/debugging-qa-test-failures/#qa-test-pipelines
Diagram of deployments and e2e workflow: https://docs.google.com/presentation/d/1A0G1_HE19Y3X2K3fTnKl0wKx-AaVetibl3UlA6i8NjQ/edit#slide=id.g13a3474b417_0_3
E2E dashboards: https://about.gitlab.com/handbook/engineering/quality/quality-engineering/test-metrics-dashboards/
(Auto-)Deployer pipelines: https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/pipelines

Diagram that summarizes the places where we run e2e tests

– Source: https://docs.google.com/presentation/d/1A0G1_HE19Y3X2K3fTnKl0wKx-AaVetibl3UlA6i8NjQ/edit#slide=id.g13a3474b417_0_3

`master` - https://gitlab.com/gitlab-org/gitlab

2-hourly scheduled pipeline:
- e2e:package-and-test: 2298 tests. https://gitlab.com/gitlab-org/gitlab/-/pipelines/773002387
- start-review-app-pipeline
  - review-qa-smoke : 22 tests
  - review-qa-blocking (aka reliable): 145 tests
  - review-qa-non-blocking: 0 tests
nightly scheduled pipeline:
- e2e:package-and-test: 2298 tests. https://gitlab.com/gitlab-org/gitlab/-/pipelines/771914467
- start-review-app-pipeline
  - review-qa-smoke : 22 tests
  - review-qa-blocking (aka reliable): 145 tests
  - review-qa-non-blocking: 0 tests

Nightly package - https://gitlab.com/gitlab-org/quality/nightly

e2e:package-and-test: 3984 tests (include quarantined tests). https://gitlab.com/gitlab-org/quality/nightly/-/pipelines/777869253

Staging-ref environment - https://ops.gitlab.net/gitlab-org/quality/staging-ref

Staging Ref is a sandbox environment used for pre-production testing of the latest Staging Canary code.

daily geo tests: 26 tests. https://ops.gitlab.net/gitlab-org/quality/staging-ref/-/pipelines/1717756
Deployment QA pipeline: trigerred on deployments.
- qa-smoke and qa-reliable: 150 tests. https://ops.gitlab.net/gitlab-org/quality/staging-ref/-/pipelines/1720016
- qa-full: 426 tests. https://ops.gitlab.net/gitlab-org/quality/staging-ref/-/pipelines/1718864

Staging-canary environment - https://ops.gitlab.net/gitlab-org/quality/staging-canary

Staging-Canary is an environment subset or deployment "stage" in the Staging environment, sharing most of the same infrastructure as Staging. This additional stage is designed to assist us with capturing issues arising due to mixed deployments, where we have multiple versions of one or more components of GitLab that share services such as the database. Information on how to access it, use it, and what services it covers is documented in our handbook page on canary stage environments.

daily full QA suite: 403 tests. https://ops.gitlab.net/gitlab-org/quality/staging/-/pipelines/1646910
Deployment QA pipeline: trigerred on deployments.
- qa-smoke and qa-reliable: 150 tests. https://ops.gitlab.net/gitlab-org/quality/staging-canary/-/pipelines/1718151

Canary - https://ops.gitlab.net/gitlab-org/quality/canary

Production-Canary is a environment subset or deployment "stage" in the Production environment, sharing most of the same infrastructure as Production. This additional stage is designed to assist us with rolling out new releases to end users in a more controlled fashion, hoping to catch issues affecting users in a way that minimises impact.

Deployment QA pipeline: trigerred on deployments.
- qa-smoke and qa-reliable: 135 tests. https://ops.gitlab.net/gitlab-org/quality/canary/-/pipelines/1718350
- qa-full: 403 tests. https://ops.gitlab.net/gitlab-org/quality/canary/-/pipelines/1718349
- Mixed-deployment tests on gprd-cny and gprd (smoke-main) - Blocking

Staging environment - https://ops.gitlab.net/gitlab-org/quality/staging

4-hourly no admin smoke tests: qa-smoke and qa-reliable: 143 tests. https://ops.gitlab.net/gitlab-org/quality/staging/-/pipelines/1452025
Deployment QA pipeline: trigerred on deployments.
- qa-smoke and qa-reliable: 150 tests. https://ops.gitlab.net/gitlab-org/quality/staging/-/pipelines/1718152
Post-deployment QA pipeline: trigerred after post-deploy migrations.
- qa-smoke and qa-reliable: 150 tests. https://ops.gitlab.net/gitlab-org/quality/staging/-/pipelines/1713232
- qa-full: 452 tests. https://ops.gitlab.net/gitlab-org/quality/staging/-/pipelines/1713231
Inactive. ~~4-hourly smoke tests: qa-smoke and qa-reliable: 135 tests. https://ops.gitlab.net/gitlab-org/quality/staging/-/pipelines/1718255~~
Inactive. ~~daily full QA suite: 188 tests. https://ops.gitlab.net/gitlab-org/quality/staging/-/pipelines/1646910~~
Inactive. ~~daily geo tests: 26 tests. https://ops.gitlab.net/gitlab-org/quality/staging/-/pipelines/870962~~

Preprod environment - https://ops.gitlab.net/gitlab-org/quality/preprod

The pre environment is an environment used for validating release candidates used to prepare final self-managed releases and production patches. It does not have a full production HA topology or a copy of the production database.

Monthly pre-release smoke tests. Runs daily from the 15th to 22nd inclusive: 128 tests. https://ops.gitlab.net/gitlab-org/quality/preprod/-/pipelines/1681301

Release environent - https://ops.gitlab.net/gitlab-org/quality/release

The release environment is an environment used for validating security releases, self-managed final monthly and patch versions. It does not have a full production HA topology or a copy of the production database.

Deployment QA pipeline: trigerred on deployments.
- qa-smoke and qa-reliable: 129 tests. https://ops.gitlab.net/gitlab-org/quality/release/-/pipelines/1700540

Production environment - https://ops.gitlab.net/gitlab-org/quality/production

No scheduled pipelines
Deployment QA pipeline: trigerred on deployments. qa-smoke and qa-reliable: 135 tests. https://ops.gitlab.net/gitlab-org/quality/production/-/pipelines/1718351
When a feature flag is toggled via ChatOps. 403 tests. https://ops.gitlab.net/gitlab-org/quality/production/-/pipelines/1718043

Conclusions & Proposals

Preliminary notes

Staging & staging-canary look very stable (failure notifications are very rare)
Canary seems to have more failures than production
Do we need to run quarantined tests at all? These jobs are allowed to fail and don't seem to add any value.

Nightly

Nightly runs ce: jobs in addition to ee: jobs. Is it required?
gitlab-org/gitlab and gitlab-org/quality/nightly don't seem to run the same jobs. For instance airgapped tests only run in the latter, while cloud-activation only runs in the former.
Proposal: Migrate gitlab-org/quality/nightly project to g... (#198 - closed)

`gitlab-org/quality/nightly`

Questions: What's the difference between e2e:package-and-test in https://gitlab.com/gitlab-org/quality/nightly and in https://gitlab.com/gitlab-org/gitlab? Can we stop using https://gitlab.com/gitlab-org/quality/nightly entirely?
Proposal: Migrate gitlab-org/quality/nightly project to g... (#198 - closed)

`gitlab-org/gitlab`

Proposal: Stop running e2e:package-and-test-ee on gitlab-org/gitlab nightly schedules: these already run every 2 hours. Implemented.

`staging-ref`

From #174 (comment 1285274596):

Proposal: Leave only Sanity suite running against Staging Ref or even just Smoke subset - to continue validating that env is healthy. Full suite can be triggered manually if needed via schedule => https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/merge_requests/539

`staging-canary`

Question: Do we need to run daily full QA suite? We already run the full suite on master, staging-ref, canary and staging deployments.
Answered by Zeff at #174 (comment 1282036371):

Yes. staging-canary is our first opportunity to capture issues by testing in a more production-like environment. If we remove a full run, I would remove it from staging since the purpose of that environment now is to mimic what we already have in production and won't really help us catch something early in the process. The only other tests running here are smoke/reliable on deployments.

`canary`

Proposal: Stop running the full test suite. Won't do: gitlab-org/gitlab!122008 (merged).

`staging`

Proposal: Stop running the 4-hourly no admin smoke tests schedule against staging: could we run them against staging-ref instead? => gitlab-org/gitlab#415028 (closed)
Proposal: Stop running full test suite after deployment, run it only after post-deploy migrations => Done by https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/merge_requests/517/diffs#diff-content-bc5cab53d68206462edfe0a987db0a44662251bb
Proposal: When a feature flag is toggled via ChatOps. qa-smoke and qa-reliable: 135 tests (instead of 405 tests). From #174 (comment 1282628933). https://ops.gitlab.net/gitlab-org/quality/production/-/pipelines/1718043 => gitlab-com/chatops!376 (merged)

`production`

Proposal: When a feature flag is toggled via ChatOps. qa-smoke and qa-reliable: 135 tests (instead of 405 tests). From #174 (comment 1282628933). https://ops.gitlab.net/gitlab-org/quality/production/-/pipelines/1718043 => gitlab-com/chatops!376 (merged)

Edited Jun 19, 2023 by Rémy Coutable