Self-serve tracing tool against Mimir for real use case scenario

This issue proposes using production-scale workloads as test cases for evaluating distributed tracing solutions, as part of the distributed tracing evaluation epic.

While #4411 (closed) focuses on surveying tracing tools to build proof-of-concept test applications, we need to complement that approach by validating candidate solutions against real production workloads at scale.

As noted in the epic, our Mimir platform (~300 reads/sec, ~3k writes/sec) represents one of GitLab's largest distributed systems and would provide realistic load patterns for evaluation.

Synthetic test applications can't fully replicate:

  • Actual traffic patterns and request distributions
  • Real performance characteristics and bottlenecks
  • Production complexity of multi-component distributed systems
  • Authentic scale that reveals storage, query, and operational challenges

Objectives

  • Instrument a production-scale system with candidate tracing solution(s)
  • Evaluate performance impact under realistic load
  • Test trace correlation across distributed components
  • Assess operational complexity at production scale
  • Validate cost and storage projections with real data volumes

Success Criteria

  • Successfully instrument a production-scale system to collect traces under production load
  • Measure actual performance overhead
  • Document real-world scalability, cost, and operational findings
Edited by Hercules Merscher