[SPIKE] Investigate alternative caching strategies to reduce Gitaly load while maintaining cache consistency

Background

This is a follow-up to #556727 (closed) to decide the long-term approach for the ref_existence_check_gitaly feature flag.

The feature flag has been fully rolled out to production and is currently enabled. However, during rollout, we observed significant concerns about the approach's long-term sustainability.

Since we shouldn't maintain long-lived feature flags, we need to determine the best path forward that balances performance, cache consistency, and resource constraints across different deployment types.

Problem Statement

We're at a crossroads with two infradev issues (#525991, #525992) that depend on removing the Redis cache entirely. However:

The 10x Gitaly load increase may not be sustainable for self-managed instances with limited resources.
The infradev issues have low severity/priority with no new reports in over a year.
Dependent feature flags (#578116, #576403) have been paused pending this investigation.
We lack clear metrics and testing to validate the impact on non-GitLab.com environments.

Rollout Findings

During the rollout of ref_existence_check_gitaly FF to 50%, we observed:

5x increase in ListRefs calls to Gitaly.
A 100% rollout represents approximately a 10x increase compared to the cached version.
After disabling and re-enabling, the pattern confirmed that the load increase is directly tied to this feature.

See epic discussion for metrics and graphs.

Key Concerns

Gitaly Load Impact: The 10x increase in ListRefs calls are significantly higher than initially expected
Self-Managed Instances: While GitLab.com infrastructure can handle this load, customer environments may have:
- Less memory for filesystem caching.
- Higher disk read-latency and lower IOPS capacity.
- Different workload patterns that amplify the impact.
Sustainability: Even though infrastructure can handle it now, this doesn't mean it's the optimal long-term solution.

Investigation Goals

This spike aims to:

Evaluate alternative caching strategies that maintain consistency without the 10x increase in Gitaly load.
- Assess the incremental cache approach (#567993 (closed)) currently enabled in staging
- Explore hybrid caching systems (e.g., cuckoo filter algorithm as suggested in #534121)
- Identify other potential solutions.
Establish testing and metrics framework
- Use load performance testing in non-production environments (https://docs.gitlab.com/ci/testing/load_performance_testing/).
- Identify key metrics. We need a clear marker of what's improved with this change, and a reference to monitor post-rollout.
Assess risks and tradeoffs for each approach
- Impact on GitLab.com, self-managed, and dedicated instances.
- Cache consistency guarantees.
- Implementation complexity and maintenance burden.
- Resource requirements (memory, IOPS, network).
Determine rollout strategy
- Identify which changes are safe to keep and which need to be reverted.
- Define testing requirements before production rollout.
- Establish monitoring and rollback criteria.

Options to Consider

Option A: Keep current approach (cache removal)

✅ Feature flag is already rolled out to 100%
✅ Solves cache consistency issues (#539287, #572341 (closed))
✅ Simpler codebase without cache complexity
❌ 10x increase in Gitaly load
❌ Potential performance issues for self-managed instances
Next step: Clean up feature flag and remove caching code

Option B: Revert and explore the incremental cache approach

✅ Avoids the 10x Gitaly load increase
✅ Reduces pressure on Redis
✅ Better for self-managed instances with limited resources
❌ Might introduce hard-to-debug cache consistency issues
❌ More complex implementation
Next step: Disable feature flag, investigate #567993 (closed)

Option C: Hybrid approach

Keep feature flag, but don't enable by default for self-managed
Allow GitLab.com to use direct Gitaly calls
Allow self-managed instances to opt in if they have sufficient resources
Investigate incremental cache improvements for the default behavior

Questions to Answer

Do we have data on the performance of self-managed instances with this change?
What is the acceptable threshold for Gitaly load increase?
Can we implement the incremental cache approach (#567993 (closed)) without introducing cache consistency bugs?
Should we consider a TTL-based approach with shorter expiration windows?
What is the long-term vision for reference caching in GitLab?

Related Issues

Parent epic: #17190
Rollout issue: #556727 (closed)
Alternative approach: #567993 (closed)
Related bugs: #539287, #572341 (closed)

Decision

To be filled in after discussion

Next Steps

Review rollout metrics and impact analysis
Discuss options with team and stakeholders
Make decision on long-term approach
Create implementation issue(s) based on decision
Update #556727 (closed) with decision and next steps

Edited Nov 18, 2025 by 🤖 GitLab Bot 🤖