Skip to content

Decide on long-term approach for ref_existence_check_gitaly feature flag

Context

This is a follow-up to #556727 (closed) to decide the long-term approach for the ref_existence_check_gitaly feature flag.

The feature flag has been fully rolled out to production and is currently enabled. However, during rollout we observed significant concerns about whether this approach is sustainable long-term.

Rollout Findings

During the rollout to 50%, we observed:

  • 5x increase in ListRefs calls to Gitaly
  • A 100% rollout represents approximately a 10x increase compared to the cached version
  • After disabling and re-enabling, the pattern confirmed the load increase is directly tied to this feature

See epic discussion for metrics and graphs.

Key Concerns

  1. Gitaly Load Impact: The 10x increase in ListRefs calls is significantly higher than initially expected
  2. Self-Managed Instances: While GitLab.com infrastructure can handle this load, customer environments may have:
    • Less memory for filesystem caching
    • Higher disk read-latency and lower IOPS capacity
    • Different workload patterns that amplify the impact
  3. Sustainability: Even though infrastructure can handle it now, this doesn't mean it's the optimal long-term solution

Options to Consider

Option A: Keep current approach (cache removal)

  • Feature flag is already rolled out to 100%
  • Solves cache consistency issues (#539287, #572341 (closed))
  • Simpler codebase without cache complexity
  • 10x increase in Gitaly load
  • Potential performance issues for self-managed instances
  • Next step: Clean up feature flag and remove caching code

Option B: Revert and explore incremental cache approach

  • Avoids the 10x Gitaly load increase
  • Reduces pressure on Redis
  • Better for self-managed instances with limited resources
  • Might introduce hard-to-debug cache consistency issues
  • More complex implementation
  • Next step: Disable feature flag, investigate #567993

Option C: Hybrid approach

  • Keep feature flag but don't enable by default for self-managed
  • Allow GitLab.com to use direct Gitaly calls
  • Allow self-managed instances to opt-in if they have sufficient resources
  • Investigate incremental cache improvements for default behavior

Questions to Answer

  1. Do we have data on self-managed instance performance with this change?
  2. What is the acceptable threshold for Gitaly load increase?
  3. Can we implement the incremental cache approach (#567993) without introducing cache consistency bugs?
  4. Should we consider a TTL-based approach with shorter expiration windows?
  5. What is the long-term vision for reference caching in GitLab?

Related Issues

Decision

To be filled in after discussion

Next Steps

  • Review rollout metrics and impact analysis
  • Discuss options with team and stakeholders
  • Make decision on long-term approach
  • Create implementation issue(s) based on decision
  • Update #556727 (closed) with decision and next steps
Edited by 🤖 GitLab Bot 🤖