Add section for debugging live env test failures with Duo guide (1a8d2170) · Commits · GitLab.com / Content Sites / handbook

content/handbook/engineering/testing/_index.md

+2 −1

Original line number	Diff line number	Diff line
		@@ -140,7 +140,8 @@ our established request process:
		- [Reporting of Top Flaky Test Files](flaky-tests/_index.md#reporting-of-top-flaky-test-files) - Weekly assignments for high-impact flaky tests
		- [Product Engineer guide to E2E test failure issues](guide-to-e2e-test-failure-issues.md)
		- [Unhealthy Tests (Developer Docs)](https://docs.gitlab.com/development/testing_guide/unhealthy_tests/) - Technical debugging reference for GitLab contributors
		- [🪄 Debug MR Test Failures with Duo](using-duo-to-debug-test-failures-in-mrs.md) - Use Duo to quickly diagnose and fix test failures in your MR
		- [🪄 Debug MR Test Failures with Duo](using-duo-to-debug-test-failures.md#-using-duo-to-debug-and-fix-test-failures-in-your-merge-request) - Use Duo to quickly diagnose and fix test failures in your MR
		- [🔥 Debug Live Environment Test Failures with Duo](using-duo-to-debug-test-failures.md#-using-duo-to-debug-live-environment-test-failures) - Use Duo to quickly diagnose and fix test failures in your MR

		#### 📹 GitLab End-to-End Testing Overview (Video)

content/handbook/engineering/testing/oncall-rotation.md

+2 −2

Original line number	Diff line number	Diff line
		@@ -90,8 +90,8 @@ For teams requesting upgrade support (within or outside office hours):

		NOTE! We don't have many team members in APAC area, so sometimes there will be an empty window of 4 hours during which we kindly ask you to use our troubleshooting guides:

		- [Using Duo to debug test failures in MRs](/handbook/engineering/testing/using-duo-to-debug-test-failures-in-mrs/)
		- [Guide to E2E test failure issues](/handbook/engineering/testing/guide-to-e2e-test-failure-issues/)
		- [Using Duo to debug test failures](../testing/using-duo-to-debug-test-failures.md)
		- [Guide to E2E test failure issues](../testing/guide-to-e2e-test-failure-issues.md)

		For the Pipeline DRI:

content/handbook/engineering/testing/using-duo-to-debug-test-failures-in-mrs.md→content/handbook/engineering/testing/using-duo-to-debug-test-failures.md

+166 −0

Original line number	Diff line number	Diff line
		---
		title: Debug Test Failures in Merge Requests with Duo
		description: Concise guide to using Duo to diagnose and suggest fixes for test failures in a merge request.
		title: Debug Test Failures and Live Issues with Duo
		description: Concise guide to using Duo to diagnose and fix test failures in MRs and live environment E2E test pipelines.
		---

		GitLab Duo can help you quickly diagnose and resolve test failures in two key scenarios:

		- [Debug MR Test Failures](#-using-duo-to-debug-and-fix-test-failures-in-your-merge-request) - Determine if failures are related to your changes and get suggested fixes
		- [Debug Live Environment Failures](#-using-duo-to-debug-live-environment-test-failures) - Diagnose issues in staging, canary, and production monitoring pipelines

		---

		## 🪄 Using Duo to Debug and Fix Test Failures in Your Merge Request
		@@ -47,7 +54,7 @@ When your merge request has a failing test, use Duo to quickly determine if it's
		- Always review suggestions carefully before applying them
		- Test the fix locally if possible before committing

		## ⚠️ When Duo Can't Help
		### ⚠️ When Duo Can't Help

		If Duo's analysis does not resolve your issue, follow these steps in order:

		@@ -57,13 +64,103 @@ If Duo's analysis does not resolve your issue, follow these steps in order:
		2. 💻 Try reproducing locally (~10 minutes):
		- Execute the test against your GDK to confirm if it's environment-specific or a genuine issue
		3. 🚧 Request quarantine if needed:
		- If the failure is blocking `master` and is unrelated to your changes, consider the [Test Quarantine Process](./quarantine-process.md)
		- If the failure is blocking `master` and is unrelated to your changes, consider the [Test Quarantine Process](quarantine-process.md)

		---

		## 🔥 Using Duo to Debug Live Environment Test Failures

		When automated E2E tests fail in staging or production pipelines, use Duo to quickly diagnose whether it's an environmental issue, bug or a test problem.

		> ✨ Note: GitLab Duo is available on the ops instance (ops.gitlab.net) and can be used directly in job logs there.
		>
		> ⚠️ Critical: Staging-Canary Impact
		> Smoke test failures (`qa-smoke` jobs) in the [staging-canary pipeline](https://ops.gitlab.net/gitlab-org/quality/staging-canary/-/pipelines) block deployments to production. When debugging these failures, prioritize determining whether the issue is a genuine application problem or a test issue that can be safely quarantined.

		1. Navigate to the failing pipeline on ops.gitlab.net:
		- [Staging-Canary Pipeline](https://ops.gitlab.net/gitlab-org/quality/staging-canary/-/pipelines)
		- [Staging Pipeline](https://ops.gitlab.net/gitlab-org/quality/staging/-/pipelines)
		- [Canary Pipeline](https://ops.gitlab.net/gitlab-org/quality/canary/-/pipelines)
		- [Production Pipeline](https://ops.gitlab.net/gitlab-org/quality/production/-/pipelines)

		2. Open the failing job that displays the test failure

		3. Invoke GitLab Duo in the job log view (press `d` or click the Duo button in the top right corner)

		> 📝 Note: Duo will automatically truncate lengthy job logs by removing the middle section. For greater accuracy, you can copy and paste the specific stack trace and error message into your prompt.
		>
		> 🌐 For browser-based tests: Duo cannot download artifacts automatically. If needed, you can manually download the DOM from the `failure_screenshots` directory or relevant artifacts in the job and paste it into your prompt to help Duo debug browser-based failures.

		4. Clear previous context to avoid confusion with other investigations:

		```text
		/clear
		```

		5. Prompt: Analyze the failure:

		```text
		Analyze this test failure in [staging-canary/staging/canary/production]:

		Context: Tests are automatically retried within a job. If a test passed on that automatic retry, ignore it completely.

		1. Check automatic retry status first - Look for retry sections in the log. Do not mention any tests that passed on automatic retry - we only care about tests that failed both attempts within the job.
		2. For persistent failures (failed both the initial attempt AND the automatic retry):
		- What failed and why? (error type, correlation IDs, specific error messages like 404s)
		- Likely cause: environment issue, flaky test, or genuine application problem?
		- Search https://gitlab.com/gitlab-org/quality/engineering-productivity for similar issues
		3. Urgency:
		- ⚠️ Persistent `qa-smoke` failure in staging-canary = DEPLOYMENT BLOCKER
		- Other persistent failures = Assess user impact

		Recommended actions (in order):
		1. Retry the entire job first (even persistent-within-job failures often pass on full job retry)
		2. If still failing AND blocking deployment:
		- If clearly a flaky/environment issue (not a real application bug): Use fast-quarantine immediately to unblock deployment
		- Link: https://gitlab.com/gitlab-org/quality/engineering-productivity/fast-quarantine
		- If genuine application issue that should not be released to customers: create an incident - DO NOT quarantine
		3. If not blocking deployment but is causing too much failure noise: Follow standard quarantine process

		Do not mention:
		- Tests that passed on automatic retry
		- Test case reporting output (test_case iid, Labels updated, etc.)

		Note: Distinguish application issues from test problems - don't quarantine real bugs.

		Provide issue links at the end.
		```

		✅ What to expect:

		- Duo will help distinguish between real environment issues and flaky/broken tests
		- Provides suggested fixes for test issues
		- Identifies potential service/application issues for escalation
		- Helps assess impact and urgency

		### ⚠️ When Duo Can't Help

		If Duo's analysis does not resolve your issue:

		1. 💻 Try reproducing locally (~10 minutes):
		- Try logging onto the environment manually and reproducing the test case
		- Execute the test against the target environment by using credentials from 1Password
		2. 🔍 Cross-check with related systems:
		- Check [#incident-management](https://gitlab.enterprise.slack.com/archives/CB7P5CJS1) Slack channel for recent incidents
		- Check [GitLab.com status page](https://status.gitlab.com) for known incidents
		- Review recent deployments that might correlate with the failure
		- Look for patterns across multiple environment pipelines
		3. 🚧 Quarantine if needed:
		- For urgent deployment-blocking smoke tests: Use [fast-quarantine](https://gitlab.com/gitlab-org/quality/engineering-productivity/fast-quarantine) to immediately unblock deployments
		- For non-urgent test issues: Follow the [Test Quarantine Process](quarantine-process.md)
		- Important: Only quarantine test issues, not genuine application bugs - escalate those instead

		---

		## 📚 Related Resources

		- [Testing Guide](_index.md) - Complete testing overview
		- [GitLab Testing Guide](https://docs.gitlab.com/development/testing_guide) - Technical implementation details
		- [Detailed Quarantine Process](./quarantine-process.md)) - How to quarantine tests
		- [Test Quarantine Process](quarantine-process.md) - How to quarantine tests
		- [Guide to E2E Test Failure Issues](guide-to-e2e-test-failure-issues.md) - Product engineer debugging guide

		Need help? Reach out in [#s_developer_experience](https://gitlab.enterprise.slack.com/archives/C07TWBRER7H) or create a [Request for Help issue](https://gitlab.com/gitlab-org/quality/test-governance/request-for-help/-/issues/new)