Decide on long-term approach for ref_existence_check_gitaly feature flag
Context
This is a follow-up to #556727 (closed) to decide the long-term approach for the ref_existence_check_gitaly feature flag.
The feature flag has been fully rolled out to production and is currently enabled. However, during rollout we observed significant concerns about whether this approach is sustainable long-term.
Rollout Findings
During the rollout to 50%, we observed:
-
5x increase in
ListRefscalls to Gitaly - A 100% rollout represents approximately a 10x increase compared to the cached version
- After disabling and re-enabling, the pattern confirmed the load increase is directly tied to this feature
See epic discussion for metrics and graphs.
Key Concerns
-
Gitaly Load Impact: The 10x increase in
ListRefscalls is significantly higher than initially expected -
Self-Managed Instances: While GitLab.com infrastructure can handle this load, customer environments may have:
- Less memory for filesystem caching
- Higher disk read-latency and lower IOPS capacity
- Different workload patterns that amplify the impact
- Sustainability: Even though infrastructure can handle it now, this doesn't mean it's the optimal long-term solution
Options to Consider
Option A: Keep current approach (cache removal)
-
✅ Feature flag is already rolled out to 100% -
✅ Solves cache consistency issues (#539287, #572341 (closed)) -
✅ Simpler codebase without cache complexity -
❌ 10x increase in Gitaly load -
❌ Potential performance issues for self-managed instances - Next step: Clean up feature flag and remove caching code
Option B: Revert and explore incremental cache approach
-
✅ Avoids the 10x Gitaly load increase -
✅ Reduces pressure on Redis -
✅ Better for self-managed instances with limited resources -
❌ Might introduce hard-to-debug cache consistency issues -
❌ More complex implementation - Next step: Disable feature flag, investigate #567993
Option C: Hybrid approach
- Keep feature flag but don't enable by default for self-managed
- Allow GitLab.com to use direct Gitaly calls
- Allow self-managed instances to opt-in if they have sufficient resources
- Investigate incremental cache improvements for default behavior
Questions to Answer
- Do we have data on self-managed instance performance with this change?
- What is the acceptable threshold for Gitaly load increase?
- Can we implement the incremental cache approach (#567993) without introducing cache consistency bugs?
- Should we consider a TTL-based approach with shorter expiration windows?
- What is the long-term vision for reference caching in GitLab?
Related Issues
- Parent epic: #17190
- Rollout issue: #556727 (closed)
- Alternative approach: #567993
- Related bugs: #539287, #572341 (closed)
Decision
To be filled in after discussion
Next Steps
-
Review rollout metrics and impact analysis -
Discuss options with team and stakeholders -
Make decision on long-term approach -
Create implementation issue(s) based on decision -
Update #556727 (closed) with decision and next steps
Edited by 🤖 GitLab Bot 🤖