Self-serve tracing tool against Mimir for real use case scenario

This issue proposes using production-scale workloads as test cases for evaluating distributed tracing solutions, as part of the distributed tracing evaluation epic.

While #4411 (closed) focuses on surveying tracing tools to build proof-of-concept test applications, we need to complement that approach by validating candidate solutions against real production workloads at scale.

As noted in the epic, our Mimir platform (~300 reads/sec, ~3k writes/sec) represents one of GitLab's largest distributed systems and would provide realistic load patterns for evaluation.

Synthetic test applications can't fully replicate:

Actual traffic patterns and request distributions
Real performance characteristics and bottlenecks
Production complexity of multi-component distributed systems
Authentic scale that reveals storage, query, and operational challenges

Objectives

Instrument a production-scale system with candidate tracing solution(s)
Evaluate performance impact under realistic load
Test trace correlation across distributed components
Assess operational complexity at production scale
Validate cost and storage projections with real data volumes

Success Criteria

Successfully instrument a production-scale system to collect traces under production load
Measure actual performance overhead
Document real-world scalability, cost, and operational findings

Edited Feb 27, 2026 by Hercules Merscher