CI Job Telemetry - Runner Instrumentation
## Overview Instrument GitLab Runner to collect and stream telemetry spans for CI job execution using OpenTelemetry Protocol (OTLP), pushing directly to the OTEL Collector via [LabKit](https://gitlab.com/gitlab-org/labkit). ## Parent Epic &20632 ## Phased Delivery ### Phase 1: End-to-end integration (Feature negotiation + first span) The goal is to validate the full pipeline — Runner → OTEL Collector → ClickHouse → Grafana — with a single `job_execution` span. | Issue | Description | |-------|-------------| | gitlab-org/gitlab-runner#39231+ | Feature negotiation (`job_telemetry` feature flag + `trace_context` from job payload), OTLP export client (via LabKit), and first `job_execution` span covering the full job lifecycle | **Dependencies**: Rails-side feature negotiation (gitlab-org/gitlab#590588+) and trace context initialization (gitlab-org/gitlab#590587+) must be implemented first. ### Phase 2: Built-in build stage spans Instrument each built-in build stage as a child span under `job_execution`. | Issue | Description | |-------|-------------| | gitlab-org/gitlab-runner#39230+ | Spans for `prepare_executor`, `pull_image`, `get_sources`, `restore_cache`, `step_script`, `after_script`, `archive_cache`, `upload_artifacts` with stage-specific metadata | **Estimate**: Phases 1 + 2 combined: ~1 week (per @ash2k) ### Phase 3: CI Functions spans Instrument CI Function invocations as child spans under `job_execution`. | Issue | Description | |-------|-------------| | gitlab-org/gitlab-runner#39271+ | Spans for each CI Function invocation with function name, version, and status | **Estimate**: ~2 weeks (conservative) **Total Runner estimate: ~3 weeks** (Phases 1+2: ~1 week, Phase 3: ~2 weeks) ## Key Implementation Details - **LabKit integration**: Use [LabKit](https://gitlab.com/gitlab-org/labkit) for OTEL SDK integration — aligns with the rest of GitLab's instrumentation - **OTEL Collector endpoint**: Static runner manager configuration (not passed per-job) - **OIDC/workload identity auth**: GitLab.com hosted runners authenticate directly using OIDC tokens - **Graceful degradation**: Telemetry failures must never fail jobs - **Streaming**: Spans are pushed as stages/functions complete (not batched at job end) ## Architecture Reference <https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#gitlab-runner-changes>
epic