Verified Commit 1a8d2170 authored by Jay McCure's avatar Jay McCure
Browse files

Add section for debugging live env test failures with Duo guide

parent cbf65bda
Loading
Loading
Loading
Loading
+2 −1
Original line number Diff line number Diff line
@@ -140,7 +140,8 @@ our established request process:
  - [Reporting of Top Flaky Test Files](flaky-tests/_index.md#reporting-of-top-flaky-test-files) - Weekly assignments for high-impact flaky tests
- [Product Engineer guide to E2E test failure issues](guide-to-e2e-test-failure-issues.md)
- [Unhealthy Tests (Developer Docs)](https://docs.gitlab.com/development/testing_guide/unhealthy_tests/) - Technical debugging reference for GitLab contributors
- [🪄 Debug MR Test Failures with Duo](using-duo-to-debug-test-failures-in-mrs.md) - Use Duo to quickly diagnose and fix test failures in your MR
- [🪄 Debug MR Test Failures with Duo](using-duo-to-debug-test-failures.md#-using-duo-to-debug-and-fix-test-failures-in-your-merge-request) - Use Duo to quickly diagnose and fix test failures in your MR
- [🔥 Debug Live Environment Test Failures with Duo](using-duo-to-debug-test-failures.md#-using-duo-to-debug-live-environment-test-failures) - Use Duo to quickly diagnose and fix test failures in your MR

#### 📹 GitLab End-to-End Testing Overview (Video)

+2 −2
Original line number Diff line number Diff line
@@ -90,8 +90,8 @@ For teams requesting upgrade support (within or outside office hours):

NOTE! We don't have many team members in APAC area, so sometimes there will be an empty window of 4 hours during which we kindly ask you to use our troubleshooting guides:

- [Using Duo to debug test failures in MRs](/handbook/engineering/testing/using-duo-to-debug-test-failures-in-mrs/)
- [Guide to E2E test failure issues](/handbook/engineering/testing/guide-to-e2e-test-failure-issues/)
- [Using Duo to debug test failures](../testing/using-duo-to-debug-test-failures.md)
- [Guide to E2E test failure issues](../testing/guide-to-e2e-test-failure-issues.md)

For the Pipeline DRI:

+166 −0
Original line number Diff line number Diff line
---
title: Debug Test Failures in Merge Requests with Duo
description: Concise guide to using Duo to diagnose and suggest fixes for test failures in a merge request.
title: Debug Test Failures and Live Issues with Duo
description: Concise guide to using Duo to diagnose and fix test failures in MRs and live environment E2E test pipelines.
---

GitLab Duo can help you quickly diagnose and resolve test failures in two key scenarios:

- **[Debug MR Test Failures](#-using-duo-to-debug-and-fix-test-failures-in-your-merge-request)** - Determine if failures are related to your changes and get suggested fixes
- **[Debug Live Environment Failures](#-using-duo-to-debug-live-environment-test-failures)** - Diagnose issues in staging, canary, and production monitoring pipelines

---

## 🪄 Using Duo to Debug and Fix Test Failures in Your Merge Request
@@ -47,7 +54,7 @@ When your merge request has a failing test, use Duo to quickly determine if it's
- Always review suggestions carefully before applying them
- Test the fix locally if possible before committing

## ⚠️ When Duo Can't Help
### ⚠️ When Duo Can't Help

If Duo's analysis does not resolve your issue, follow these steps in order:

@@ -57,13 +64,103 @@ If Duo's analysis does not resolve your issue, follow these steps in order:
   2. **💻 Try reproducing locally** (~10 minutes):
      - Execute the test against your GDK to confirm if it's environment-specific or a genuine issue
   3. **🚧 Request quarantine if needed**:
      - If the failure is blocking `master` and is unrelated to your changes, consider the [Test Quarantine Process](./quarantine-process.md)
      - If the failure is blocking `master` and is unrelated to your changes, consider the [Test Quarantine Process](quarantine-process.md)

---

## 🔥 Using Duo to Debug Live Environment Test Failures

When automated E2E tests fail in staging or production pipelines, use Duo to quickly diagnose whether it's an environmental issue, bug or a test problem.

   > **✨ Note:** GitLab Duo is available on the ops instance (ops.gitlab.net) and can be used directly in job logs there.
   >
   > **⚠️ Critical: Staging-Canary Impact**
   > Smoke test failures (`qa-smoke` jobs) in the [staging-canary pipeline](https://ops.gitlab.net/gitlab-org/quality/staging-canary/-/pipelines) **block deployments to production**. When debugging these failures, prioritize determining whether the issue is a genuine application problem or a test issue that can be safely quarantined.

1. **Navigate to the failing pipeline** on ops.gitlab.net:
    - [Staging-Canary Pipeline](https://ops.gitlab.net/gitlab-org/quality/staging-canary/-/pipelines)
    - [Staging Pipeline](https://ops.gitlab.net/gitlab-org/quality/staging/-/pipelines)
    - [Canary Pipeline](https://ops.gitlab.net/gitlab-org/quality/canary/-/pipelines)
    - [Production Pipeline](https://ops.gitlab.net/gitlab-org/quality/production/-/pipelines)

2. **Open the failing job** that displays the test failure

3. **Invoke GitLab Duo** in the job log view (press `d` or click the Duo button in the top right corner)

   > **📝 Note:** Duo will automatically truncate lengthy job logs by removing the middle section. For greater accuracy, you can copy and paste the specific stack trace and error message into your prompt.
   >
   > **🌐 For browser-based tests:** Duo cannot download artifacts automatically. If needed, you can manually download the DOM from the `failure_screenshots` directory or relevant artifacts in the job and paste it into your prompt to help Duo debug browser-based failures.

4. **Clear previous context** to avoid confusion with other investigations:

   ```text
      /clear
   ```

5. **Prompt: Analyze the failure:**

   ```text
      Analyze this test failure in [staging-canary/staging/canary/production]:

      **Context:** Tests are automatically retried within a job. If a test passed on that automatic retry, ignore it completely.

      1. **Check automatic retry status first** - Look for retry sections in the log. **Do not mention any tests that passed on automatic retry** - we only care about tests that failed both attempts within the job.
      2. For persistent failures (failed both the initial attempt AND the automatic retry):
         - What failed and why? (error type, correlation IDs, specific error messages like 404s)
         - Likely cause: environment issue, flaky test, or genuine application problem?
         - Search https://gitlab.com/gitlab-org/quality/engineering-productivity for similar issues
      3. **Urgency:**
         - ⚠️ Persistent `qa-smoke` failure in staging-canary = DEPLOYMENT BLOCKER
         - Other persistent failures = Assess user impact

      **Recommended actions (in order):**
      1. **Retry the entire job first** (even persistent-within-job failures often pass on full job retry)
      2. **If still failing AND blocking deployment:**
         - If clearly a flaky/environment issue (not a real application bug): **Use fast-quarantine immediately** to unblock deployment
           - Link: https://gitlab.com/gitlab-org/quality/engineering-productivity/fast-quarantine
         - If genuine application issue that should not be released to customers: **create an incident** - DO NOT quarantine
      3. **If not blocking deployment but is causing too much failure noise:** Follow standard quarantine process

      **Do not mention:**
      - Tests that passed on automatic retry
      - Test case reporting output (test_case iid, Labels updated, etc.)

      Note: Distinguish application issues from test problems - don't quarantine real bugs.

      Provide issue links at the end.
   ```

**✅ What to expect:**

- Duo will help distinguish between real environment issues and flaky/broken tests
- Provides suggested fixes for test issues
- Identifies potential service/application issues for escalation
- Helps assess impact and urgency

### ⚠️ When Duo Can't Help

If Duo's analysis does not resolve your issue:

1. **💻 Try reproducing locally** (~10 minutes):
    - Try logging onto the environment manually and reproducing the test case
    - Execute the test against the target environment by using credentials from 1Password
2. **🔍 Cross-check with related systems**:
    - Check [**#incident-management**](https://gitlab.enterprise.slack.com/archives/CB7P5CJS1) Slack channel for recent incidents
    - Check [GitLab.com status page](https://status.gitlab.com) for known incidents
    - Review recent deployments that might correlate with the failure
    - Look for patterns across multiple environment pipelines
3. **🚧 Quarantine if needed:**
    - **For urgent deployment-blocking smoke tests:** Use [fast-quarantine](https://gitlab.com/gitlab-org/quality/engineering-productivity/fast-quarantine) to immediately unblock deployments
    - **For non-urgent test issues:** Follow the [Test Quarantine Process](quarantine-process.md)
    - **Important:** Only quarantine test issues, not genuine application bugs - escalate those instead

---

## 📚 Related Resources

- [Testing Guide](_index.md) - Complete testing overview
- [GitLab Testing Guide](https://docs.gitlab.com/development/testing_guide) - Technical implementation details
- [Detailed Quarantine Process](./quarantine-process.md)) - How to quarantine tests
- [Test Quarantine Process](quarantine-process.md) - How to quarantine tests
- [Guide to E2E Test Failure Issues](guide-to-e2e-test-failure-issues.md) - Product engineer debugging guide

**Need help?** Reach out in [**#s_developer_experience**](https://gitlab.enterprise.slack.com/archives/C07TWBRER7H) or create a [Request for Help issue](https://gitlab.com/gitlab-org/quality/test-governance/request-for-help/-/issues/new)