Skip to content

Improve automated quarantine process to ensure flaky tests are handled in time

Problem Statement

The current automated quarantine process has several critical issues that prevent flaky tests from being properly addressed:

  1. Unassigned Quarantine MRs: Not all automated quarantine MRs are being assigned to the appropriate group/team, resulting in MRs remaining open for 3+ months without review
  2. Lack of Resolution Context: Quarantine MRs are being closed without clear documentation of whether the tests were fixed, removed, or are still flaky
  3. Ownership Ambiguity: Some of the tests have 'shared' responsibility, making it unclear which team should address the flaky test
  4. Process Effectiveness Unknown: Without active team involvement, we cannot evaluate if the auto-quarantine process is too aggressive or not aggressive enough

Current State

Based on epic findings:

  • 101 auto quarantine MRs opened > 1 month ago without any update
  • 420+ tests with shared ownership
  • 10+ tests in quarantine with shared ownership

Proposed Improvements

1. Fix Auto-Assignment of Quarantine MRs

  • Ensure all automated quarantine MRs are assigned to the correct group based on test's feature_category metadata
  • Implement validation to prevent MRs from being created without proper assignment
  • Add fallback assignment logic for tests without clear ownership

2. Enforce Resolution Documentation

  • Require quarantine MRs to include resolution context before closing:
    • Test was fixed (link to fix MR)
    • Test was removed (justification for removal)
    • Test remains quarantined (reason and follow-up plan)
  • Add MR template for quarantine MRs with required fields

3. Eliminate 'Shared' Test Ownership

  • Audit all tests with 'shared' responsibility
  • Work with teams to assign clear ownership based on feature area
  • Update test metadata to reflect proper ownership
  • Document ownership assignment guidelines in handbook

4. Establish Team Engagement Process

  • Send automated notifications to team leads when quarantine MRs are created
  • Include quarantine metrics in milestone planning
  • Create dashboards showing quarantine MRs by team
  • Set SLAs for quarantine MR review (e.g., within 7 days)

5. Improve Auto-Quarantine Intelligence

  • Once teams are actively engaged, analyze:
    • False positive rate (tests quarantined that aren't actually flaky)
    • Miss rate (flaky tests not caught by auto-quarantine)
    • Optimal failure threshold before quarantine
  • Adjust quarantine criteria based on findings

Success Criteria

  • 100% of new automated quarantine MRs have team assignment
  • 0 tests with 'shared' ownership in quarantine
  • Average time to review quarantine MRs < 5 days
  • All closed quarantine MRs have resolution context
Edited by Chloe Liu