Self-serve tracing tool against Mimir for real use case scenario
This issue proposes using production-scale workloads as test cases for evaluating distributed tracing solutions, as part of the distributed tracing evaluation epic.
While #4411 (closed) focuses on surveying tracing tools to build proof-of-concept test applications, we need to complement that approach by validating candidate solutions against real production workloads at scale.
As noted in the epic, our Mimir platform (~300 reads/sec, ~3k writes/sec) represents one of GitLab's largest distributed systems and would provide realistic load patterns for evaluation.
Synthetic test applications can't fully replicate:
- Actual traffic patterns and request distributions
- Real performance characteristics and bottlenecks
- Production complexity of multi-component distributed systems
- Authentic scale that reveals storage, query, and operational challenges
Objectives
- Instrument a production-scale system with candidate tracing solution(s)
- Evaluate performance impact under realistic load
- Test trace correlation across distributed components
- Assess operational complexity at production scale
- Validate cost and storage projections with real data volumes
Success Criteria
- Successfully instrument a production-scale system to collect traces under production load
- Measure actual performance overhead
- Document real-world scalability, cost, and operational findings
Edited by Hercules Merscher