Proposal: Remove gitlab-org Prefix Routing to cny Stage
Current Routing Configuration
Our HAProxy routing currently implements these rules in order:
-
Internal prefix rules: All requests to
gitlab-org/*,gitlab-com/*, and several other internal prefixes → route to cny stage - Random sampling: For remaining traffic, ~10% → cny stage, ~90% → main stage
This results in cny receiving:
- 100% of traffic from internal GitLab repositories (including our busiest repo
gitlab-org/gitlab) - ~10% of all other repository traffic
Problem
This configuration creates unrepresentative traffic patterns on cny. The gitlab-org/gitlab repository alone generates massive load, and combined with other internal repositories being routed to cny, the stage receives disproportionately high load from a small subset of repositories. This skewed traffic distribution generates false positive alerts that create significant burden for the on-call team.
Solution
Remove the prefix-based routing rules for gitlab-org, gitlab-com, and other internal prefixes. Simplify to a single routing rule: ~10% random sampling of all traffic to cny, ~90% to main.
Benefits
- Eliminates false positives: cny traffic patterns will mirror main stage proportionally, reducing alert noise
- Reduces on-call burden: Fewer spurious alerts from unrepresentative load patterns
- Improves traffic representation: cny becomes a true representative sample of production load
- Maintains testing capability: We retain the ability to drain cny and shift traffic to main
- Simplifies routing logic: Single routing rule instead of multiple special cases
- No meaningful loss: The current internal user testing on cny via prefix routing is not serving us well
Implementation
-
Change: Update HAProxy configuration to remove only
gitlab-org/gitlabfrom cny routing, allowing it to follow the ~10% random sampling rule like other repositories - Observation Period: Wait 2 weeks to collect baseline data on cny alert and incident volume
-
Assessment: Review metrics comparing pre/post change:
- cny alert frequency and false positive rate
- On-call burden and incident response time
- Traffic distribution patterns
-
Decision Point:
- If impact is positive → proceed to remove all remaining prefix rules (
gitlab-com/*, etc.) - If neutral → consider keeping current state or proceeding cautiously
- If negative → roll back immediately
- If impact is positive → proceed to remove all remaining prefix rules (
- Rollback Capability: Maintain ability to quickly restore previous HAProxy routing configuration at any stage
Risk Assessment
Low risk: Phased approach with clear measurement criteria and immediate rollback capability. Starting with the single most problematic repository minimizes potential impact while providing valuable data.