Skip to content

Proposal: Remove gitlab-org Prefix Routing to cny Stage

Current Routing Configuration

Our HAProxy routing currently implements these rules in order:

  1. Internal prefix rules: All requests to gitlab-org/*, gitlab-com/*, and several other internal prefixes → route to cny stage
  2. Random sampling: For remaining traffic, ~10% → cny stage, ~90% → main stage

This results in cny receiving:

  • 100% of traffic from internal GitLab repositories (including our busiest repo gitlab-org/gitlab)
  • ~10% of all other repository traffic

Problem

This configuration creates unrepresentative traffic patterns on cny. The gitlab-org/gitlab repository alone generates massive load, and combined with other internal repositories being routed to cny, the stage receives disproportionately high load from a small subset of repositories. This skewed traffic distribution generates false positive alerts that create significant burden for the on-call team.

Solution

Remove the prefix-based routing rules for gitlab-org, gitlab-com, and other internal prefixes. Simplify to a single routing rule: ~10% random sampling of all traffic to cny, ~90% to main.

Benefits

  • Eliminates false positives: cny traffic patterns will mirror main stage proportionally, reducing alert noise
  • Reduces on-call burden: Fewer spurious alerts from unrepresentative load patterns
  • Improves traffic representation: cny becomes a true representative sample of production load
  • Maintains testing capability: We retain the ability to drain cny and shift traffic to main
  • Simplifies routing logic: Single routing rule instead of multiple special cases
  • No meaningful loss: The current internal user testing on cny via prefix routing is not serving us well

Implementation

  1. Change: Update HAProxy configuration to remove only gitlab-org/gitlab from cny routing, allowing it to follow the ~10% random sampling rule like other repositories
  2. Observation Period: Wait 2 weeks to collect baseline data on cny alert and incident volume
  3. Assessment: Review metrics comparing pre/post change:
    • cny alert frequency and false positive rate
    • On-call burden and incident response time
    • Traffic distribution patterns
  4. Decision Point:
    • If impact is positive → proceed to remove all remaining prefix rules (gitlab-com/*, etc.)
    • If neutral → consider keeping current state or proceeding cautiously
    • If negative → roll back immediately
  5. Rollback Capability: Maintain ability to quickly restore previous HAProxy routing configuration at any stage

Risk Assessment

Low risk: Phased approach with clear measurement criteria and immediate rollback capability. Starting with the single most problematic repository minimizes potential impact while providing valuable data.