Survey distributed tracing tools
This issue covers the survey and evaluation of distributed tracing solutions as part of the Distributed Tracing: Next epic. The goal is to identify 2-3 candidate solutions for deeper proof-of-concept testing.
Evaluation Criteria
Solutions will be assessed against the following requirements:
- OpenTelemetry API compatibility - Must support OpenTelemetry standards
- Production-ready Ruby client libraries - Critical requirement based on past failures
- LabKit integration capability - Must integrate with or have clear path to integrate with:
- Leverage existing infrastructure - Solutions that can use Elasticsearch or ClickHouse have preference (but not required if limitations/costs are prohibitive)
- Low operational overhead - Minimal maintenance burden for the Observability team
- Proven scalability - Must handle projected GitLab.com traffic volumes
Evaluation Dimensions
- Usability: Query interface, visualization, debugging experience
- Performance: Application overhead, query performance
- Cost: Storage, infrastructure, licensing at projected scale
- Maintainability: Operational complexity, deployment model
- Maturity: Community support, client library reliability, production usage
Solution Categories to Survey
Cloud-Based / Managed Solutions
- Pros: Lower operational overhead, proven scalability
- Cons: Potentially higher costs, less control, data sovereignty considerations
- Examples to evaluate: Honeycomb, Datadog APM, New Relic, Grafana Cloud Tempo, AWS X-Ray, Google Cloud Trace
Self-Managed / Open Source Solutions
- Pros: Cost control, infrastructure reuse potential, full control
- Cons: Higher operational burden for Observability team
- Examples to evaluate: Jaeger (revisit with current state), Grafana Tempo, Zipkin, SigNoz, Uptrace
Deliverables
For each surveyed solution, document:
-
Overview
- Brief description and architecture
- Deployment model (cloud vs self-managed)
- Current maturity and production usage
-
Technical Assessment
- OpenTelemetry support status
- Ruby client library quality and reliability
- Go client library quality
- Storage backend options
- LabKit integration possible
-
Operational Analysis
- Deployment complexity
- Maintenance requirements
- Team expertise needed
- Monitoring and alerting requirements
-
Trade-offs
- Key advantages
- Key limitations
- Deal-breakers (if any)
-
Recommendation
- Include in PoC phase: Yes / No
- Reasoning
Success Criteria
-
Create comparison matrix with all evaluation criteria -
Present findings to team for feedback and alignment -
Identify 2-3 candidates for PoC phase with clear justification
Edited by Hercules Merscher