Skip to content

CI/CD Observability: Tracing with OpenTelemetry

Status update: 2025-04-28 @DarrenEastman

The Verify stage at GitLab is not working on tracing with OpenTelemetry for CI/CD in calendar year 2025. We are adding additional CI/CD observability capabilities in the GitLab UI and that can also be accessed via API.

Release notes

Problem to solve

CI/CD Pipelines are one key part of DevSecOps workflows and often lack regular updates and iteration. Sometimes they consume too many resources, in other scenarios, they are inefficient and could use a boost.

Pipelines are customized with the jobs and commands being run, and it is hard to get an insight on what is going on.

Add Pipeline Efficiency docs

The first iteration was to provide Pipeline Efficiency docs to ensure users have a good basis to start from. The process is time-consuming, and so is metrics collection inside pipelines.

Data to collect

Next to metrics, which are potentially tied to job duration in seconds, and the CI job logs, we also want to see a combination: Traces with spans to see more context on long-running jobs. Is it the apt install, or the compiler which is taking the most time and resources in a pipeline job? Artifacts and cache collection might also be interesting to visualize.

Tools for CI/CD Observability?

OpenTracing and Jaeger Tracing are well known, and GitLab has integrations to display the traces with Jaeger but cannot collect traces from CI/CD yet.

Honeycomb developed a small tool called buildevents which can be called in CI/CD scripts to send spans and traces to the Honeycomb server. This follows a generic format and is heavily based on environment variables and their manipulation. This should be hidden from the user experience in the first step.

New framework: OpenTelemetry

Note: OTel is a shorter name for OpenTelemetry.

Over time, OpenTelemetry was developed as a specification, with a collector interface, and client SDK libraries.

This presentation in 2020 gives an overview and how tracing works: https://docs.google.com/presentation/d/1MAVFeSsTNVWC9wPGOlg83wh8GFtR9hPbdVHuumtgOWA/edit

The principle of OpenTelemetry is to provide a specification and collector (framework). The backends are known projects like Jaeger, Grafana Tempo, Prometheus, Elasticsearch, etc.

OpenTelemetry Collector Design as Service

(Image credit: OpenTelemetry docs)

Tracing specification

image

(Image credit: Jaeger docs)

User demand examples

Customers:

@k33g one of my customers is strongly interested in this topic Internal link it meets exactly his requirements - #338943 (comment 812222012)

Proposal

Implement OpenTelemetry clients in GitLab Server and Runner to send traces and spans to the OpenTelemetry collector. The collector defines where to store the traces, for testing purposes this will be Jaeger Tracing.

The implementation needs multiple steps in GitLab's architecture, implementing the server (Ruby) and runner (Go) parts which are explained in the JTBD section below.

Similar to the Datadog integration in !46564 (diffs) the idea is to provide entry points in the CI/CD pipelines to send traces towards an OpenTelemetry collector.

Additional resources

Proposed Steps

Preparations: Dev Environment

  • docker-compose setup with Jaeger, Prometheus, OTel
  • k3s with OTel, Jaeger, Prometheus Operators
  • Opstrace (ongoing)

OTel config in CI/CD variables

Define the names following the official specification: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/sdk-environment-variables.md

  • Config settings for OpenTelemetry: Host, port, auth, exporter

Example from the Elastic documentation:

export OTEL_EXPORTER_OTLP_ENDPOINT="elastic-apm-server.example.com:8200"
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer an_apm_secret_token"
export OTEL_TRACES_EXPORTER="otlp"

GitLab Runner - OpenTelemetry

GitLab Server - OpenTelemetry

Connect Server with Runner - Trace ID

Pass the trace IDs from the server to the runner.

  • Start Trace, generate ID
  • Job start span
  • Send Trace ID as environment variable (and OTel configuration as CI/CD variables) to runner
  • Runner starts span for executor, uses trace ID
  • Runner emits traces to OTel
  • Runner finishes job, uploads artifacts, etc.
  • Server updates trace end

Context metadata enrichment

User defined tracing

  • Provide a CLI tool or CI keyword to trigger span/trace creation inside CI/CD jobs
  • User specifies data - consider the parameter input, additional context, etc.
  • Similar to release-cli/terraform - needs to be available in the executed job containers
  • TODO: Dedicated issue

The Elastic documentation refers to otel-cli.

https://github.com/krzko/opentelemetry-shell offers a library to send traces, metrics, etc. from Shell scripts to OpenTelemetry.

Security

  • Otel environment variables can be set by
    • Admin on the instance level (self-managed default)
    • Owners/Maintainers
    • Users
  • Can variable values be overridden on a group/project level?
  • Can a user override the values in their .gitlab-ci.yml configuration?
  • SaaS: Multiple OTel endpoints
    • Infrastructure team operating GitLab.com SaaS
    • User-defined endpoints on a group level
    • Performance?

Scaling

There are different architectures where this feature can be enabled.

  • Mid-sized GitLab instance
  • Small embedded hardware
  • Large scale environment
  • GitLab.com SaaS

Research and exploration will need benchmarks based on feature availability.

  • Enabled by default?
  • Performance impacts for Runner and Server
  • Increased resource consumption (CPU, memory, etc.) for Runner and Server

Additional thoughts to consider

  • How to debug problems introduced with and by this feature?
  • How to review and accept future additions, e.g. more enriched metadata
    • Does it need security reviews, e.g. before this is enabled on GitLab.com SaaS
  • Development docs
  • Development environment

CI/CD Observability does not stop when deployment done

Integrate with Kubernetes deployments, include the agent, Kubernetes itself, and the application being deployed.

Idea: Find a way to link the Trace IDs together, and have a representative metadata link: CI_ENVIRONMENT variable value in all traces to verify application problems based on environment filters. Done in the Opstrace UI.

Agent for Kubernetes

  • Implement OTel tracing
  • Can act as OpenTelementry collector / forwarder / tunnel to the GitLab server?
  • TODO: Create dedicated issue.

Limitations and Scope

A problem with the runner enabling OpenTelemetry in jobs could be network limitations, whereas the GitLab server runs in the same segment as the OpenTelemetry backends. The runner may need to cache the traces and send them back to the server which forwards them to OpenTelemetry.

Pipelines and jobs need to send a defined set of traces by default; user customization in CI/CD pipelines is desired. E.g. to specify a keyword in the YAML config which adds a span to a trace context, then closes it. That likely needs GitLab Runner side-specific implementation and is out of scope for this first MVC.

OpenTelemetry supports traces in GA. In the future, it also will support metrics and log events. Metrics will be tracked as a separate feature proposal.

Use Cases

CI/CD Observability dashboard

Build a CI/CD Observability dashboard that shows the pipeline and job execution as a tracing dashboard. A span with a starT/end time gives insights on the job duration time and provides additional metadata for the context. This can be helpful to identify long-running jobs, or immediately visualize external problems.

Better insights for support and professional services teams

Analyzing long-running jobs in detail helps optimize pipelines for more efficiency, and reduce cost.

script command sections can help replace log parsing tools, like https://gitlab.com/gitlab-com/support/toolbox/list-slowest-job-sections/

Dogfooding on GitLab.com

Enable tracing for CI/CD and analyse pipelines for selected projects.

Needs performance analysis, and potentially multi-tenant environments as OTel collectors.

Opstrace Tracing

  • Use a multi-tenant Opstrace Tracing collector with OTel.
  • Demo video, 2022-01-17: https://www.youtube.com/watch?v=IjW9d-UpARs
  • Integrate the Opstrace UI for Tracing into GitLab CI/CD dashboards.
  • Evaluate options for self-managed and SaaS to provide this functionality out-of-the-box.

Integrate with Datadog

Adopting OpenTelemetry on the client-side allows specifying an OTel endpoint which can be a vendor Datadog endpoint. https://docs.datadoghq.com/tracing/setup_overview/open_standards/#opentelemetry-collector-datadog-exporter

To-be-defined: This may be able to supersede the Datadog integration with a unified interface with OpenTelemetry, depending on the implementation going forward.

Example Implementations

OpenTelemetry has a registry with examples: https://opentelemetry.io/registry/

CI specific implementations:

Intended users

User experience goal

When the OpenTelemetry integration is enabled, the job traces are automatically sent. The documentation needs to provide examples of how to set up the collector with Jaeger as backend and frontend for traces - OpenTelemetry is a complex framework, and the minutes to success need to be small.

When Opstrace is available, this should be detected out-of-the-box as OTel endpoint, showing CI/CD Observability dashboards.

Further details

Twitter thread: https://twitter.com/__steele/status/1429681895533465604

Permissions and Security

Enabling the OpenTelemetry integration should be available to both, instance admins and group/project owners and maintainers in the settings.

  • Add expected impact to members with no access (0)
  • Add expected impact to Guest (10) members
  • Add expected impact to Reporter (20) members
  • Add expected impact to Developer (30) members
  • Add expected impact to Maintainer (40) members
  • Add expected impact to Owner (50) members

Documentation

See the Feature Change Documentation Workflow https://docs.gitlab.com/ee/development/documentation/workflow.html#for-a-product-change

Availability & Testing

This section needs to be retained and filled in during the workflow planning breakdown phase of this feature proposal, if not earlier.

What risks does this change pose to our availability? How might it affect the quality of the product? What additional test coverage or changes to tests will be needed? Will it require cross-browser testing?

Please list the test areas (unit, integration and end-to-end) that needs to be added or updated to ensure that this feature will work as intended. Please use the list below as guidance.

  • Unit test changes
  • Integration test changes
  • End-to-end test change

See the test engineering planning process and reach out to your counterpart Software Engineer in Test for assistance: https://about.gitlab.com/handbook/engineering/quality/test-engineering/#test-planning

What does success look like, and how can we measure that?

  • Users start debugging and optimizing their CI/CD workflows with OpenTelemetry.
    • Add integration-enabled tracking to see how often this is used.
  • Vendor support and documentation (OTel receiver endpoints in SaaS platforms such as Datadog, Dynatrace, etc.)

What is the type of buyer?

Free tier to provide pipeline insights for everyone, backend sending to endpoint.

Premium+ tier:

  • Visualization / UX with advanced group dashboards
  • Default alerting rules and on-call settings
  • Optimization tips based on existing data

Is this a cross-stage feature?

grouppipeline execution grouprunner ~"group::monitor" are required.

Implementation Scope

Evaluate https://gitlab.com/gitlab-org/labkit-ruby as the SSoT for instrumentation.

Limit the implementation to using and describing:

  • OpenTelemetry
  • Jaeger Tracing (Traces)

Future test cases:

This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.

Edited by Darren Eastman