CI/CD Observability: Tracing with OpenTelemetry
This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.
Release notes
Problem to solve
CI/CD Pipelines are one key part of DevSecOps workflows and often lack regular updates and iteration. Sometimes they consume too many resources, in other scenarios, they are inefficient and could use a boost.
Pipelines are customized with the jobs and commands being run, and it is hard to get an insight on what is going on.
Add Pipeline Efficiency docs
The first iteration was to provide Pipeline Efficiency docs to ensure users have a good basis to start from. The process is time-consuming, and so is metrics collection inside pipelines.
Data to collect
Next to metrics, which are potentially tied to job duration in seconds, and the CI job logs, we also want to see a combination: Traces with spans to see more context on long-running jobs. Is it the apt install, or the compiler which is taking the most time and resources in a pipeline job? Artifacts and cache collection might also be interesting to visualize.
Tools for CI/CD Observability?
OpenTracing and Jaeger Tracing are well known, and GitLab has integrations to display the traces with Jaeger but cannot collect traces from CI/CD yet.
Honeycomb developed a small tool called buildevents which can be called in CI/CD scripts to send spans and traces to the Honeycomb server. This follows a generic format and is heavily based on environment variables and their manipulation. This should be hidden from the user experience in the first step.
New framework: OpenTelemetry
Note:
OTel
is a shorter name forOpenTelemetry
.
Over time, OpenTelemetry was developed as a specification, with a collector interface, and client SDK libraries.
This presentation in 2020 gives an overview and how tracing works: https://docs.google.com/presentation/d/1MAVFeSsTNVWC9wPGOlg83wh8GFtR9hPbdVHuumtgOWA/edit
The principle of OpenTelemetry is to provide a specification and collector (framework). The backends are known projects like Jaeger, Grafana Tempo, Prometheus, Elasticsearch, etc.
(Image credit: OpenTelemetry docs)
Tracing specification
- Traces specification: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/overview.md#traces
- Spans in OpenTelemetry: https://opentelemetry.lightstep.com/spans/
- https://www.jaegertracing.io/docs/1.25/architecture/
(Image credit: Jaeger docs)
User demand examples
- Hacker News topic "Faster GitLab CI/CD pipelines" - thread: https://news.ycombinator.com/item?id=29520577
- Twitter thread: https://twitter.com/__steele/status/1429681895533465604
- Elastic implemented CI/CD Observability as feature
Customers:
@k33g one of my customers is strongly interested in this topic Internal link it meets exactly his requirements - #338943 (comment 812222012)
Proposal
Implement OpenTelemetry clients in GitLab Server and Runner to send traces and spans to the OpenTelemetry collector. The collector defines where to store the traces, for testing purposes this will be Jaeger Tracing.
The implementation needs multiple steps in GitLab's architecture, implementing the server (Ruby) and runner (Go) parts which are explained in the JTBD section below.
Similar to the Datadog integration in !46564 (diffs) the idea is to provide entry points in the CI/CD pipelines to send traces towards an OpenTelemetry collector.
Additional resources
- GitLab feature discussion video with @jheimbuck_gl and @dnsmichi:
-
Embracing Observability in CI/CD with OpenTelemetry slides by Cyrille le Clerc, Elastic
- Meeting with Cyrille, notes (internal) and recording (internal)
- Next steps discussion
- CI/CD Observability from Elastic docs
- Efficient DevSecOps Pipelines slide deck from Continous Lifecycle, by @dnsmichi
Proposed Steps
Preparations: Dev Environment
- docker-compose setup with Jaeger, Prometheus, OTel
- k3s with OTel, Jaeger, Prometheus Operators
- Opstrace (ongoing)
OTel config in CI/CD variables
Define the names following the official specification: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/sdk-environment-variables.md
- Config settings for OpenTelemetry: Host, port, auth, exporter
Example from the Elastic documentation:
export OTEL_EXPORTER_OTLP_ENDPOINT="elastic-apm-server.example.com:8200"
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer an_apm_secret_token"
export OTEL_TRACES_EXPORTER="otlp"
GitLab Runner - OpenTelemetry
- Use the Go SDK to implement OTel https://github.com/open-telemetry/opentelemetry-go
- Start with a standalone app
GitLab Server - OpenTelemetry
- Use the Ruby SDK to implement OTel https://github.com/open-telemetry/opentelemetry-ruby
- Follow the entry points from the Datadog integration: https://docs.gitlab.com/ee/integration/datadog.html & #270123
Connect Server with Runner - Trace ID
Pass the trace IDs from the server to the runner.
- Start Trace, generate ID
- Job start span
- Send Trace ID as environment variable (and OTel configuration as CI/CD variables) to runner
- Runner starts span for executor, uses trace ID
- Runner emits traces to OTel
- Runner finishes job, uploads artifacts, etc.
- Server updates trace end
Context metadata enrichment
- Define which information is necessary to help debug problems.
- Review discussion with Cyrille in https://gitlab.com/gitlab-com/marketing/corporate_marketing/corporate-marketing/-/issues/5747+
- Enrich span context with metadata
User defined tracing
- Provide a CLI tool or CI keyword to trigger span/trace creation inside CI/CD jobs
- User specifies data - consider the parameter input, additional context, etc.
- Similar to release-cli/terraform - needs to be available in the executed job containers
- TODO: Dedicated issue
The Elastic documentation refers to otel-cli.
https://github.com/krzko/opentelemetry-shell offers a library to send traces, metrics, etc. from Shell scripts to OpenTelemetry.
Security
- Otel environment variables can be set by
- Admin on the instance level (self-managed default)
- Owners/Maintainers
- Users
- Can variable values be overridden on a group/project level?
- Can a user override the values in their .gitlab-ci.yml configuration?
- SaaS: Multiple OTel endpoints
- Infrastructure team operating GitLab.com SaaS
- User-defined endpoints on a group level
- Performance?
Scaling
There are different architectures where this feature can be enabled.
- Mid-sized GitLab instance
- Small embedded hardware
- Large scale environment
- GitLab.com SaaS
Research and exploration will need benchmarks based on feature availability.
- Enabled by default?
- Performance impacts for Runner and Server
- Increased resource consumption (CPU, memory, etc.) for Runner and Server
Additional thoughts to consider
- How to debug problems introduced with and by this feature?
- How to review and accept future additions, e.g. more enriched metadata
- Does it need security reviews, e.g. before this is enabled on GitLab.com SaaS
- Development docs
- Development environment
CI/CD Observability does not stop when deployment done
Integrate with Kubernetes deployments, include the agent, Kubernetes itself, and the application being deployed.
Idea: Find a way to link the Trace IDs together, and have a representative metadata link: CI_ENVIRONMENT variable value in all traces to verify application problems based on environment filters. Done in the Opstrace UI.
Agent for Kubernetes
- Implement OTel tracing
- Can act as OpenTelementry collector / forwarder / tunnel to the GitLab server?
- TODO: Create dedicated issue.
Limitations and Scope
A problem with the runner enabling OpenTelemetry in jobs could be network limitations, whereas the GitLab server runs in the same segment as the OpenTelemetry backends. The runner may need to cache the traces and send them back to the server which forwards them to OpenTelemetry.
Pipelines and jobs need to send a defined set of traces by default; user customization in CI/CD pipelines is desired. E.g. to specify a keyword in the YAML config which adds a span to a trace context, then closes it. That likely needs GitLab Runner side-specific implementation and is out of scope for this first MVC.
OpenTelemetry supports traces in GA. In the future, it also will support metrics and log events. Metrics will be tracked as a separate feature proposal.
Use Cases
CI/CD Observability dashboard
Build a CI/CD Observability dashboard that shows the pipeline and job execution as a tracing dashboard. A span with a starT/end time gives insights on the job duration time and provides additional metadata for the context. This can be helpful to identify long-running jobs, or immediately visualize external problems.
Better insights for support and professional services teams
Analyzing long-running jobs in detail helps optimize pipelines for more efficiency, and reduce cost.
script
command sections can help replace log parsing tools, like https://gitlab.com/gitlab-com/support/toolbox/list-slowest-job-sections/
Dogfooding on GitLab.com
Enable tracing for CI/CD and analyse pipelines for selected projects.
Needs performance analysis, and potentially multi-tenant environments as OTel collectors.
Opstrace Tracing
- Use a multi-tenant Opstrace Tracing collector with OTel.
- Demo video, 2022-01-17: https://www.youtube.com/watch?v=IjW9d-UpARs
- Integrate the Opstrace UI for Tracing into GitLab CI/CD dashboards.
- Evaluate options for self-managed and SaaS to provide this functionality out-of-the-box.
Integrate with Datadog
Adopting OpenTelemetry on the client-side allows specifying an OTel endpoint which can be a vendor Datadog endpoint. https://docs.datadoghq.com/tracing/setup_overview/open_standards/#opentelemetry-collector-datadog-exporter
To-be-defined: This may be able to supersede the Datadog integration with a unified interface with OpenTelemetry, depending on the implementation going forward.
Example Implementations
OpenTelemetry has a registry with examples: https://opentelemetry.io/registry/
- Kubernetes: https://kubernetes.io/docs/concepts/cluster-administration/system-traces/
- AWS: https://aws-otel.github.io/docs/setup/eks
CI specific implementations:
- Jenkins (Java): https://plugins.jenkins.io/opentelemetry/
- Teamcity (Java): https://github.com/OctopusDeploy/opentelemetry-teamcity-plugin
Intended users
- Sasha (Software Developer)
- Devon (DevOps Engineer)
- Sidney (Systems Administrator)
- Simone (Software Engineer in Test)
- Allison (Application Ops)
- Priyanka (Platform Engineer)
User experience goal
When the OpenTelemetry integration is enabled, the job traces are automatically sent. The documentation needs to provide examples of how to set up the collector with Jaeger as backend and frontend for traces - OpenTelemetry is a complex framework, and the minutes to success need to be small.
When Opstrace is available, this should be detected out-of-the-box as OTel endpoint, showing CI/CD Observability dashboards.
Further details
- https://www.cmg.org/wp-content/uploads/2021/02/eBook_GuideToOpenTelemetry.pdf
- https://docs.datadoghq.com/tracing/setup_overview/open_standards/
- https://kubernetes.io/blog/2021/09/03/api-server-tracing/
- https://jenkins-x.io/blog/2021/04/08/jx3-pipeline-trace/
- https://concourse-ci.org/tracing.html
- https://www.honeycomb.io/blog/working-on-hitting-a-release-cadence-ci-cd-observability-can-help-you-get-there/
Twitter thread: https://twitter.com/__steele/status/1429681895533465604
Permissions and Security
Enabling the OpenTelemetry integration should be available to both, instance admins and group/project owners and maintainers in the settings.
-
Add expected impact to members with no access (0) -
Add expected impact to Guest (10) members -
Add expected impact to Reporter (20) members -
Add expected impact to Developer (30) members -
Add expected impact to Maintainer (40) members -
Add expected impact to Owner (50) members
Documentation
See the Feature Change Documentation Workflow https://docs.gitlab.com/ee/development/documentation/workflow.html#for-a-product-change
- Add all known Documentation Requirements in this section. See https://docs.gitlab.com/ee/development/documentation/workflow.html
- If this feature requires changing permissions, update the permissions document. See https://docs.gitlab.com/ee/user/permissions.html
Availability & Testing
This section needs to be retained and filled in during the workflow planning breakdown phase of this feature proposal, if not earlier.
What risks does this change pose to our availability? How might it affect the quality of the product? What additional test coverage or changes to tests will be needed? Will it require cross-browser testing?
Please list the test areas (unit, integration and end-to-end) that needs to be added or updated to ensure that this feature will work as intended. Please use the list below as guidance.
- Unit test changes
- Integration test changes
- End-to-end test change
See the test engineering planning process and reach out to your counterpart Software Engineer in Test for assistance: https://about.gitlab.com/handbook/engineering/quality/test-engineering/#test-planning
What does success look like, and how can we measure that?
- Users start debugging and optimizing their CI/CD workflows with OpenTelemetry.
- Add integration-enabled tracking to see how often this is used.
- Vendor support and documentation (OTel receiver endpoints in SaaS platforms such as Datadog, Dynatrace, etc.)
What is the type of buyer?
Free tier to provide pipeline insights for everyone, backend sending to endpoint.
Premium+ tier:
- Visualization / UX with advanced group dashboards
- Default alerting rules and on-call settings
- Optimization tips based on existing data
Is this a cross-stage feature?
grouppipeline execution grouprunner ~"group::monitor" are required.
Implementation Scope
Evaluate https://gitlab.com/gitlab-org/labkit-ruby as the SSoT for instrumentation.
Limit the implementation to using and describing:
- OpenTelemetry
- Jaeger Tracing (Traces)
Future test cases:
- Opstrace integration, depending on integration milestones
FYI @jreporter @DarrenEastman @kencjohnston @andrewn @gitlab-de