Verified Commit 80ac67cc authored by Jay McCure's avatar Jay McCure
Browse files

Align quarantine and flaky test timelines

parent a18a08b4
Loading
Loading
Loading
Loading
+1 −1
Original line number Diff line number Diff line
@@ -57,7 +57,7 @@ For guidance on quarantining tests, see the [Quarantine Process (Handbook)](../q

Flaky tests are categorized by urgency based on their impact on pipeline stability:

- 🔴 **Critical**: 48 hours - Tests blocking critical workflows or affecting multiple teams
- 🔴 **Critical**: 48 hours - Tests blocking critical workflows, deployment pipelines or affecting multiple teams
- 🟠 **High**: 1 week - Tests with significant pipeline impact
- 🟡 **Medium**: 2 weeks - Tests with moderate impact

+24 −57
Original line number Diff line number Diff line
@@ -10,9 +10,9 @@ The quarantine process helps maintain pipeline stability by temporarily removing
## 🗺️ Quick navigation

| Scenario                                  | What to do                                                                |
|-------------------------------------------|--------------------------------------------------------------------------------------|
| 📬 **I've been assigned a quarantine MR** | [Review and decide within 48 hours](#youve-been-assigned-a-quarantine-merge-request) |
| 🚨 **I need to quarantine a test**        | [Quarantine types](#quarantine-types)                                                |
|-------------------------------------------|---------------------------------------------------------------------------|
| 📬 **I've been assigned a quarantine MR** | [Review and take action](#youve-been-assigned-a-quarantine-merge-request) |
| 🚨 **I need to quarantine a test**        | [Quarantine Lifecycle](#quarantine-lifecycle)                             |
| ✅ **I want to dequarantine a test**      | [Follow the dequarantine process](#dequarantine-a-test)                   |
| 🔍 **How do tests get quarantined?**      | [Automated detection & manual process](#how-tests-become-quarantined)     |
| 📊 **Where's the data?**                  | [Flaky test detection and tracking](#flaky-test-detection-and-tracking)   |
@@ -20,26 +20,23 @@ The quarantine process helps maintain pipeline stability by temporarily removing

## You've been assigned a quarantine merge request

If you have been assigned a quarantine merge request, you have been identified as an appropriate DRI to review the quarantine for the failing test.
If you have been assigned a quarantine merge request, you have been identified as an appropriate DRI to review the quarantine for the test. These merge requests are created automatically for [consistently failing flaky tests](./flaky-tests/_index.md), but can also be created manually by team members.

**What to do:**

1. **Review within 48 hours**.
1. **Decide the best approach**:
    - **Fix immediately**: If the root cause is clear and fixable within 2 weeks.
1. **Review and decide the best approach**:
    - **Fix immediately**: If the root cause is clear and a fix is known.
    - **Delete the test**: If it's low-value or redundant.
    - **Convert to lower level**: If it can be tested more reliably at unit or integration level.
    - **Quarantine**: If investigation takes longer than 2 weeks.
1. **Take action within 2 weeks**:
    - **Quarantine**: If investigation or fix will take longer than the required response time (see [Flaky tests - Urgency Tiers and Response Timelines](./flaky-tests/_index.md#urgency-tiers-and-response-timelines) for specific timelines).
1. **Take action according to the response timelines** (see [Flaky tests - Urgency Tiers and Response Timelines](./flaky-tests/_index.md#urgency-tiers-and-response-timelines) for specific timelines):
    - Merge the merge request (test enters quarantine), OR
    - Fix the test and close the merge request, OR
    - Provide feedback on why it shouldn't be quarantined.

**Note:** For flaky tests, urgency tiers may suggest different timelines. See [Flaky Tests handbook](./flaky-tests/) for details.

**If no action is taken:**

- The merge request is approved by the Pipeline DRI after 2 weeks.
- The merge request is approved by the Pipeline DRI after the urgency timeline expires.
- The test enters long-term quarantine.
- The 3-month deletion countdown begins.

@@ -93,11 +90,20 @@ For complete information about how flaky tests are detected, tracked, and report

These systems provide the foundation for the quarantine process by identifying which tests need attention.

## Quarantine types
## Quarantine lifecycle

Choose the right quarantine type based on urgency and your ability to resolve the test failure.
Fast quarantine uses a separate file in a dedicated repository for rapid merging, while long-term quarantine modifies test metadata directly in the GitLab codebase.

### Quarantine phase durations

| Phase                | Duration         | Action required                      |
|----------------------|------------------|--------------------------------------|
| Fast quarantine      | 3 days maximum   | Fix, remove, or convert to long-term |
| Long-term quarantine | 3 months maximum | Investigation and resolution         |
| Deletion warning     | 1 week           | Final opportunity to resolve         |
| Automatic deletion   | After 3 months   | Test permanently removed             |

### Fast quarantine

Use fast quarantine when:
@@ -152,45 +158,6 @@ To use long-term quarantine, create a quarantine merge request with the appropri

The maximum duration for long-term quarantine is 3 months (3 milestones or releases). The owner or Engineering Manager is alerted by the Quarantine Notification System that the test is to be deleted. The Quarantine Cleanup System then creates the deletion merge request to be actioned by the Pipeline DRI within a week.

## Quarantine lifecycle

### Timeline expectations

| Phase                | Duration         | Action required                      |
|----------------------|------------------|--------------------------------------|
| Fast quarantine      | 3 days           | Fix, remove, or convert to long-term |
| Long-term quarantine | 3 months maximum | Investigation and resolution         |
| Deletion warning     | 1 week           | Final opportunity to resolve         |
| Automatic deletion   | After 3 months   | Test permanently removed             |

### Automated quarantine merge requests

The process creates automated quarantine merge requests for consistently failing flaky tests.

When you're assigned a quarantine merge request:

1. Review it within 48 hours (optional for test changes, mandatory for test framework changes).
1. Decide the best approach:
    - **Fix immediately**: If the root cause is clear and fixable within 2 weeks.
    - **Delete the test**: If it's low-value or redundant.
    - **Convert to lower level**: If it can be tested more reliably at unit or integration level.
    - **Quarantine**: If investigation takes longer than 2 weeks.
1. Merge it or provide feedback within 2 weeks.
1. Create a follow-up issue for investigation or fix.
1. Update the issue when the fix is scheduled.

If you don't take action:

- The merge request is approved by the Pipeline DRI after 2 weeks.
- The test enters long-term quarantine.
- The 3-month deletion timeline begins.

Throughout the quarantine period, you receive:

- **Weekly reminders**: To encourage progress updates and resolution.
- **One week before deletion**: Final warning notification.
- **At deletion**: Confirmation that the test has been removed.

## Dequarantine a test

### Prerequisites