Audit of end-to-end test runs in various environments
Audit of E2E runs
The goal of this analysis is to be more efficient and reduce load/noise for SETs on call.
Useful links:
- Environments: https://about.gitlab.com/handbook/engineering/infrastructure/environments/
- GitLab.com deployments process: https://about.gitlab.com/handbook/engineering/deployments-and-releases/deployments/#gitlabcom-deployments-process
- Different between canary and main stages: https://about.gitlab.com/handbook/engineering/infrastructure/environments/canary-stage/#what-is-different-and-what-is-shared-between-canary-and-main-stages
- E2E test pipelines: https://about.gitlab.com/handbook/engineering/quality/quality-engineering/debugging-qa-test-failures/#qa-test-pipelines
- Diagram of deployments and e2e workflow: https://docs.google.com/presentation/d/1A0G1_HE19Y3X2K3fTnKl0wKx-AaVetibl3UlA6i8NjQ/edit#slide=id.g13a3474b417_0_3
- E2E dashboards: https://about.gitlab.com/handbook/engineering/quality/quality-engineering/test-metrics-dashboards/
- (Auto-)Deployer pipelines: https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/pipelines
Diagram that summarizes the places where we run e2e tests
master - https://gitlab.com/gitlab-org/gitlab
- 2-hourly scheduled pipeline:
-
e2e:package-and-test: 2298 tests. https://gitlab.com/gitlab-org/gitlab/-/pipelines/773002387 -
start-review-app-pipeline-
review-qa-smoke: 22 tests -
review-qa-blocking(akareliable): 145 tests -
review-qa-non-blocking: 0 tests
-
-
- nightly scheduled pipeline:
-
e2e:package-and-test: 2298 tests. https://gitlab.com/gitlab-org/gitlab/-/pipelines/771914467 -
start-review-app-pipeline-
review-qa-smoke: 22 tests -
review-qa-blocking(akareliable): 145 tests -
review-qa-non-blocking: 0 tests
-
-
Nightly package - https://gitlab.com/gitlab-org/quality/nightly
-
e2e:package-and-test: 3984 tests (include quarantined tests). https://gitlab.com/gitlab-org/quality/nightly/-/pipelines/777869253
Staging-ref environment - https://ops.gitlab.net/gitlab-org/quality/staging-ref
Staging Ref is a sandbox environment used for pre-production testing of the latest Staging Canary code.
- daily geo tests: 26 tests. https://ops.gitlab.net/gitlab-org/quality/staging-ref/-/pipelines/1717756
- Deployment QA pipeline: trigerred on deployments.
-
qa-smokeandqa-reliable: 150 tests. https://ops.gitlab.net/gitlab-org/quality/staging-ref/-/pipelines/1720016 -
qa-full: 426 tests. https://ops.gitlab.net/gitlab-org/quality/staging-ref/-/pipelines/1718864
-
Staging-canary environment - https://ops.gitlab.net/gitlab-org/quality/staging-canary
Staging-Canary is an environment subset or deployment "stage" in the Staging environment, sharing most of the same infrastructure as Staging. This additional stage is designed to assist us with capturing issues arising due to mixed deployments, where we have multiple versions of one or more components of GitLab that share services such as the database. Information on how to access it, use it, and what services it covers is documented in our handbook page on canary stage environments.
- daily full QA suite: 403 tests. https://ops.gitlab.net/gitlab-org/quality/staging/-/pipelines/1646910
- Deployment QA pipeline: trigerred on deployments.
-
qa-smokeandqa-reliable: 150 tests. https://ops.gitlab.net/gitlab-org/quality/staging-canary/-/pipelines/1718151
-
Canary - https://ops.gitlab.net/gitlab-org/quality/canary
Production-Canary is a environment subset or deployment "stage" in the Production environment, sharing most of the same infrastructure as Production. This additional stage is designed to assist us with rolling out new releases to end users in a more controlled fashion, hoping to catch issues affecting users in a way that minimises impact.
- Deployment QA pipeline: trigerred on deployments.
-
qa-smokeandqa-reliable: 135 tests. https://ops.gitlab.net/gitlab-org/quality/canary/-/pipelines/1718350 -
qa-full: 403 tests. https://ops.gitlab.net/gitlab-org/quality/canary/-/pipelines/1718349 - Mixed-deployment tests on gprd-cny and gprd (smoke-main) - Blocking
-
Staging environment - https://ops.gitlab.net/gitlab-org/quality/staging
- 4-hourly no admin smoke tests:
qa-smokeandqa-reliable: 143 tests. https://ops.gitlab.net/gitlab-org/quality/staging/-/pipelines/1452025 - Deployment QA pipeline: trigerred on deployments.
-
qa-smokeandqa-reliable: 150 tests. https://ops.gitlab.net/gitlab-org/quality/staging/-/pipelines/1718152
-
- Post-deployment QA pipeline: trigerred after post-deploy migrations.
-
qa-smokeandqa-reliable: 150 tests. https://ops.gitlab.net/gitlab-org/quality/staging/-/pipelines/1713232 -
qa-full: 452 tests. https://ops.gitlab.net/gitlab-org/quality/staging/-/pipelines/1713231
-
- Inactive.
4-hourly smoke tests:qa-smokeandqa-reliable: 135 tests. https://ops.gitlab.net/gitlab-org/quality/staging/-/pipelines/1718255 - Inactive.
daily full QA suite: 188 tests. https://ops.gitlab.net/gitlab-org/quality/staging/-/pipelines/1646910 - Inactive.
daily geo tests: 26 tests. https://ops.gitlab.net/gitlab-org/quality/staging/-/pipelines/870962
Preprod environment - https://ops.gitlab.net/gitlab-org/quality/preprod
The pre environment is an environment used for validating release candidates used to prepare final self-managed releases and production patches. It does not have a full production HA topology or a copy of the production database.
- Monthly pre-release smoke tests. Runs daily from the 15th to 22nd inclusive: 128 tests. https://ops.gitlab.net/gitlab-org/quality/preprod/-/pipelines/1681301
Release environent - https://ops.gitlab.net/gitlab-org/quality/release
The release environment is an environment used for validating security releases, self-managed final monthly and patch versions. It does not have a full production HA topology or a copy of the production database.
- Deployment QA pipeline: trigerred on deployments.
-
qa-smokeandqa-reliable: 129 tests. https://ops.gitlab.net/gitlab-org/quality/release/-/pipelines/1700540
-
Production environment - https://ops.gitlab.net/gitlab-org/quality/production
- No scheduled pipelines
- Deployment QA pipeline: trigerred on deployments.
qa-smokeandqa-reliable: 135 tests. https://ops.gitlab.net/gitlab-org/quality/production/-/pipelines/1718351 - When a feature flag is toggled via ChatOps. 403 tests. https://ops.gitlab.net/gitlab-org/quality/production/-/pipelines/1718043
Conclusions & Proposals
Preliminary notes
- Staging & staging-canary look very stable (failure notifications are very rare)
- Canary seems to have more failures than production
- Do we need to run quarantined tests at all? These jobs are allowed to fail and don't seem to add any value.
Nightly
-
Nightly runs
ce:jobs in addition toee:jobs. Is it required? -
gitlab-org/gitlabandgitlab-org/quality/nightlydon't seem to run the same jobs. For instanceairgappedtests only run in the latter, whilecloud-activationonly runs in the former. -
Proposal: Migrate gitlab-org/quality/nightly project to g... (#198 - closed)
gitlab-org/quality/nightly
-
Questions: What's the difference between
e2e:package-and-testin https://gitlab.com/gitlab-org/quality/nightly and in https://gitlab.com/gitlab-org/gitlab? Can we stop using https://gitlab.com/gitlab-org/quality/nightly entirely? -
Proposal: Migrate gitlab-org/quality/nightly project to g... (#198 - closed)
gitlab-org/gitlab
-
Proposal: Stop running e2e:package-and-test-eeongitlab-org/gitlabnightly schedules: these already run every 2 hours. Implemented.
staging-ref
From #174 (comment 1285274596):
-
Proposal: Leave only Sanity suite running against Staging Ref or even just Smoke subset - to continue validating that env is healthy. Full suite can be triggered manually if needed via schedule => https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/merge_requests/539
staging-canary
-
Question: Do we need to run
daily full QA suite? We already run the full suite onmaster,staging-ref,canaryandstagingdeployments. -
Answered by Zeff at #174 (comment 1282036371):
Yes.
staging-canaryis our first opportunity to capture issues by testing in a more production-like environment. If we remove a full run, I would remove it fromstagingsince the purpose of that environment now is to mimic what we already have in production and won't really help us catch something early in the process. The only other tests running here aresmoke/reliableon deployments.
canary
-
Proposal: Stop running the full test suite. Won't do: gitlab-org/gitlab!122008 (merged).
staging
-
Proposal: Stop running the 4-hourly no admin smoke testsschedule against staging: could we run them againststaging-refinstead? => gitlab-org/gitlab#415028 (closed) -
Proposal: Stop running full test suite after deployment, run it only after post-deploy migrations => Done by https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/merge_requests/517/diffs#diff-content-bc5cab53d68206462edfe0a987db0a44662251bb -
Proposal: When a feature flag is toggled via ChatOps. qa-smokeandqa-reliable: 135 tests (instead of 405 tests). From #174 (comment 1282628933). https://ops.gitlab.net/gitlab-org/quality/production/-/pipelines/1718043 => gitlab-com/chatops!376 (merged)
production
-
Proposal: When a feature flag is toggled via ChatOps. qa-smokeandqa-reliable: 135 tests (instead of 405 tests). From #174 (comment 1282628933). https://ops.gitlab.net/gitlab-org/quality/production/-/pipelines/1718043 => gitlab-com/chatops!376 (merged)
