Audit of end-to-end test runs in various environments
Audit of E2E runs
The goal of this analysis is to be more efficient and reduce load/noise for SETs on call.
Useful links:
- Environments: https://about.gitlab.com/handbook/engineering/infrastructure/environments/
- GitLab.com deployments process: https://about.gitlab.com/handbook/engineering/deployments-and-releases/deployments/#gitlabcom-deployments-process
- Different between canary and main stages: https://about.gitlab.com/handbook/engineering/infrastructure/environments/canary-stage/#what-is-different-and-what-is-shared-between-canary-and-main-stages
- E2E test pipelines: https://about.gitlab.com/handbook/engineering/quality/quality-engineering/debugging-qa-test-failures/#qa-test-pipelines
- Diagram of deployments and e2e workflow: https://docs.google.com/presentation/d/1A0G1_HE19Y3X2K3fTnKl0wKx-AaVetibl3UlA6i8NjQ/edit#slide=id.g13a3474b417_0_3
- E2E dashboards: https://about.gitlab.com/handbook/engineering/quality/quality-engineering/test-metrics-dashboards/
- (Auto-)Deployer pipelines: https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/pipelines
Diagram that summarizes the places where we run e2e tests
master
- https://gitlab.com/gitlab-org/gitlab
- 2-hourly scheduled pipeline:
-
e2e:package-and-test
: 2298 tests. https://gitlab.com/gitlab-org/gitlab/-/pipelines/773002387 -
start-review-app-pipeline
-
review-qa-smoke
: 22 tests -
review-qa-blocking
(akareliable
): 145 tests -
review-qa-non-blocking
: 0 tests
-
-
- nightly scheduled pipeline:
-
e2e:package-and-test
: 2298 tests. https://gitlab.com/gitlab-org/gitlab/-/pipelines/771914467 -
start-review-app-pipeline
-
review-qa-smoke
: 22 tests -
review-qa-blocking
(akareliable
): 145 tests -
review-qa-non-blocking
: 0 tests
-
-
https://gitlab.com/gitlab-org/quality/nightly
Nightly package --
e2e:package-and-test
: 3984 tests (include quarantined tests). https://gitlab.com/gitlab-org/quality/nightly/-/pipelines/777869253
https://ops.gitlab.net/gitlab-org/quality/staging-ref
Staging-ref environment -Staging Ref is a sandbox environment used for pre-production testing of the latest Staging Canary code.
- daily geo tests: 26 tests. https://ops.gitlab.net/gitlab-org/quality/staging-ref/-/pipelines/1717756
- Deployment QA pipeline: trigerred on deployments.
-
qa-smoke
andqa-reliable
: 150 tests. https://ops.gitlab.net/gitlab-org/quality/staging-ref/-/pipelines/1720016 -
qa-full
: 426 tests. https://ops.gitlab.net/gitlab-org/quality/staging-ref/-/pipelines/1718864
-
https://ops.gitlab.net/gitlab-org/quality/staging-canary
Staging-canary environment -Staging-Canary is an environment subset or deployment "stage" in the Staging environment, sharing most of the same infrastructure as Staging. This additional stage is designed to assist us with capturing issues arising due to mixed deployments, where we have multiple versions of one or more components of GitLab that share services such as the database. Information on how to access it, use it, and what services it covers is documented in our handbook page on canary stage environments.
- daily full QA suite: 403 tests. https://ops.gitlab.net/gitlab-org/quality/staging/-/pipelines/1646910
- Deployment QA pipeline: trigerred on deployments.
-
qa-smoke
andqa-reliable
: 150 tests. https://ops.gitlab.net/gitlab-org/quality/staging-canary/-/pipelines/1718151
-
https://ops.gitlab.net/gitlab-org/quality/canary
Canary -Production-Canary is a environment subset or deployment "stage" in the Production environment, sharing most of the same infrastructure as Production. This additional stage is designed to assist us with rolling out new releases to end users in a more controlled fashion, hoping to catch issues affecting users in a way that minimises impact.
- Deployment QA pipeline: trigerred on deployments.
-
qa-smoke
andqa-reliable
: 135 tests. https://ops.gitlab.net/gitlab-org/quality/canary/-/pipelines/1718350 -
qa-full
: 403 tests. https://ops.gitlab.net/gitlab-org/quality/canary/-/pipelines/1718349 - Mixed-deployment tests on gprd-cny and gprd (smoke-main) - Blocking
-
https://ops.gitlab.net/gitlab-org/quality/staging
Staging environment -- 4-hourly no admin smoke tests:
qa-smoke
andqa-reliable
: 143 tests. https://ops.gitlab.net/gitlab-org/quality/staging/-/pipelines/1452025 - Deployment QA pipeline: trigerred on deployments.
-
qa-smoke
andqa-reliable
: 150 tests. https://ops.gitlab.net/gitlab-org/quality/staging/-/pipelines/1718152
-
- Post-deployment QA pipeline: trigerred after post-deploy migrations.
-
qa-smoke
andqa-reliable
: 150 tests. https://ops.gitlab.net/gitlab-org/quality/staging/-/pipelines/1713232 -
qa-full
: 452 tests. https://ops.gitlab.net/gitlab-org/quality/staging/-/pipelines/1713231
-
- Inactive.
4-hourly smoke tests:qa-smoke
andqa-reliable
: 135 tests. https://ops.gitlab.net/gitlab-org/quality/staging/-/pipelines/1718255 - Inactive.
daily full QA suite: 188 tests. https://ops.gitlab.net/gitlab-org/quality/staging/-/pipelines/1646910 - Inactive.
daily geo tests: 26 tests. https://ops.gitlab.net/gitlab-org/quality/staging/-/pipelines/870962
https://ops.gitlab.net/gitlab-org/quality/preprod
Preprod environment -The pre environment is an environment used for validating release candidates used to prepare final self-managed releases and production patches. It does not have a full production HA topology or a copy of the production database.
- Monthly pre-release smoke tests. Runs daily from the 15th to 22nd inclusive: 128 tests. https://ops.gitlab.net/gitlab-org/quality/preprod/-/pipelines/1681301
https://ops.gitlab.net/gitlab-org/quality/release
Release environent -The release environment is an environment used for validating security releases, self-managed final monthly and patch versions. It does not have a full production HA topology or a copy of the production database.
- Deployment QA pipeline: trigerred on deployments.
-
qa-smoke
andqa-reliable
: 129 tests. https://ops.gitlab.net/gitlab-org/quality/release/-/pipelines/1700540
-
https://ops.gitlab.net/gitlab-org/quality/production
Production environment -- No scheduled pipelines
- Deployment QA pipeline: trigerred on deployments.
qa-smoke
andqa-reliable
: 135 tests. https://ops.gitlab.net/gitlab-org/quality/production/-/pipelines/1718351 - When a feature flag is toggled via ChatOps. 403 tests. https://ops.gitlab.net/gitlab-org/quality/production/-/pipelines/1718043
Conclusions & Proposals
Preliminary notes
- Staging & staging-canary look very stable (failure notifications are very rare)
- Canary seems to have more failures than production
- Do we need to run quarantined tests at all? These jobs are allowed to fail and don't seem to add any value.
Nightly
-
Nightly runs
ce:
jobs in addition toee:
jobs. Is it required? -
gitlab-org/gitlab
andgitlab-org/quality/nightly
don't seem to run the same jobs. For instanceairgapped
tests only run in the latter, whilecloud-activation
only runs in the former. -
Proposal: Migrate gitlab-org/quality/nightly project to g... (#198 - closed)
gitlab-org/quality/nightly
-
Questions: What's the difference between
e2e:package-and-test
in https://gitlab.com/gitlab-org/quality/nightly and in https://gitlab.com/gitlab-org/gitlab? Can we stop using https://gitlab.com/gitlab-org/quality/nightly entirely? -
Proposal: Migrate gitlab-org/quality/nightly project to g... (#198 - closed)
gitlab-org/gitlab
-
Proposal: Stop running e2e:package-and-test-ee
ongitlab-org/gitlab
nightly schedules: these already run every 2 hours. Implemented.
staging-ref
From #174 (comment 1285274596):
-
Proposal: Leave only Sanity suite running against Staging Ref or even just Smoke subset - to continue validating that env is healthy. Full suite can be triggered manually if needed via schedule => https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/merge_requests/539
staging-canary
-
Question: Do we need to run
daily full QA suite
? We already run the full suite onmaster
,staging-ref
,canary
andstaging
deployments. -
Answered by Zeff at #174 (comment 1282036371):
Yes.
staging-canary
is our first opportunity to capture issues by testing in a more production-like environment. If we remove a full run, I would remove it fromstaging
since the purpose of that environment now is to mimic what we already have in production and won't really help us catch something early in the process. The only other tests running here aresmoke/reliable
on deployments.
canary
-
Proposal: Stop running the full test suite. Won't do: gitlab-org/gitlab!122008 (merged).
staging
-
Proposal: Stop running the 4-hourly no admin smoke tests
schedule against staging: could we run them againststaging-ref
instead? => gitlab-org/gitlab#415028 -
Proposal: Stop running full test suite after deployment, run it only after post-deploy migrations => Done by https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/merge_requests/517/diffs#diff-content-bc5cab53d68206462edfe0a987db0a44662251bb -
Proposal: When a feature flag is toggled via ChatOps. qa-smoke
andqa-reliable
: 135 tests (instead of 405 tests). From #174 (comment 1282628933). https://ops.gitlab.net/gitlab-org/quality/production/-/pipelines/1718043 => gitlab-com/chatops!376 (merged)
production
-
Proposal: When a feature flag is toggled via ChatOps. qa-smoke
andqa-reliable
: 135 tests (instead of 405 tests). From #174 (comment 1282628933). https://ops.gitlab.net/gitlab-org/quality/production/-/pipelines/1718043 => gitlab-com/chatops!376 (merged)