Survey distributed tracing tools

This issue covers the survey and evaluation of distributed tracing solutions as part of the Distributed Tracing: Next epic. The goal is to identify 2-3 candidate solutions for deeper proof-of-concept testing.

Evaluation Criteria

Solutions will be assessed against the following requirements:

  1. OpenTelemetry API compatibility - Must support OpenTelemetry standards
  2. Production-ready Ruby client libraries - Critical requirement based on past failures
  3. LabKit integration capability - Must integrate with or have clear path to integrate with:
  4. Leverage existing infrastructure - Solutions that can use Elasticsearch or ClickHouse have preference (but not required if limitations/costs are prohibitive)
  5. Low operational overhead - Minimal maintenance burden for the Observability team
  6. Proven scalability - Must handle projected GitLab.com traffic volumes

Evaluation Dimensions

  • Usability: Query interface, visualization, debugging experience
  • Performance: Application overhead, query performance
  • Cost: Storage, infrastructure, licensing at projected scale
  • Maintainability: Operational complexity, deployment model
  • Maturity: Community support, client library reliability, production usage

Solution Categories to Survey

Cloud-Based / Managed Solutions

  • Pros: Lower operational overhead, proven scalability
  • Cons: Potentially higher costs, less control, data sovereignty considerations
  • Examples to evaluate: Honeycomb, Datadog APM, New Relic, Grafana Cloud Tempo, AWS X-Ray, Google Cloud Trace

Self-Managed / Open Source Solutions

  • Pros: Cost control, infrastructure reuse potential, full control
  • Cons: Higher operational burden for Observability team
  • Examples to evaluate: Jaeger (revisit with current state), Grafana Tempo, Zipkin, SigNoz, Uptrace

Deliverables

For each surveyed solution, document:

  1. Overview

    • Brief description and architecture
    • Deployment model (cloud vs self-managed)
    • Current maturity and production usage
  2. Technical Assessment

    • OpenTelemetry support status
    • Ruby client library quality and reliability
    • Go client library quality
    • Storage backend options
    • LabKit integration possible
  3. Operational Analysis

    • Deployment complexity
    • Maintenance requirements
    • Team expertise needed
    • Monitoring and alerting requirements
  4. Trade-offs

    • Key advantages
    • Key limitations
    • Deal-breakers (if any)
  5. Recommendation

    • Include in PoC phase: Yes / No
    • Reasoning

Success Criteria

  • Create comparison matrix with all evaluation criteria
  • Present findings to team for feedback and alignment
  • Identify 2-3 candidates for PoC phase with clear justification

Distributed Tracing Survey Tools Spreadsheet

Edited by Hercules Merscher