Survey distributed tracing tools

This issue covers the survey and evaluation of distributed tracing solutions as part of the Distributed Tracing: Next epic. The goal is to identify 2-3 candidate solutions for deeper proof-of-concept testing.

Evaluation Criteria

Solutions will be assessed against the following requirements:

OpenTelemetry API compatibility - Must support OpenTelemetry standards
Production-ready Ruby client libraries - Critical requirement based on past failures
LabKit integration capability - Must integrate with or have clear path to integrate with:
- LabKit Ruby
- LabKit Go
Leverage existing infrastructure - Solutions that can use Elasticsearch or ClickHouse have preference (but not required if limitations/costs are prohibitive)
Low operational overhead - Minimal maintenance burden for the Observability team
Proven scalability - Must handle projected GitLab.com traffic volumes

Evaluation Dimensions

Usability: Query interface, visualization, debugging experience
Performance: Application overhead, query performance
Cost: Storage, infrastructure, licensing at projected scale
Maintainability: Operational complexity, deployment model
Maturity: Community support, client library reliability, production usage

Solution Categories to Survey

Cloud-Based / Managed Solutions

Pros: Lower operational overhead, proven scalability
Cons: Potentially higher costs, less control, data sovereignty considerations
Examples to evaluate: Honeycomb, Datadog APM, New Relic, Grafana Cloud Tempo, AWS X-Ray, Google Cloud Trace

Self-Managed / Open Source Solutions

Pros: Cost control, infrastructure reuse potential, full control
Cons: Higher operational burden for Observability team
Examples to evaluate: Jaeger (revisit with current state), Grafana Tempo, Zipkin, SigNoz, Uptrace

Deliverables

For each surveyed solution, document:

Overview
- Brief description and architecture
- Deployment model (cloud vs self-managed)
- Current maturity and production usage
Technical Assessment
- OpenTelemetry support status
- Ruby client library quality and reliability
- Go client library quality
- Storage backend options
- LabKit integration possible
Operational Analysis
- Deployment complexity
- Maintenance requirements
- Team expertise needed
- Monitoring and alerting requirements
Trade-offs
- Key advantages
- Key limitations
- Deal-breakers (if any)
Recommendation
- Include in PoC phase: Yes / No
- Reasoning

Success Criteria

Create comparison matrix with all evaluation criteria
Present findings to team for feedback and alignment
Identify 2-3 candidates for PoC phase with clear justification

Distributed Tracing Survey Tools Spreadsheet

Edited Nov 19, 2025 by Hercules Merscher