CI Job Telemetry - Runner Instrumentation
## Overview
Instrument GitLab Runner to collect and stream telemetry spans for CI job execution using OpenTelemetry Protocol (OTLP), pushing directly to the OTEL Collector via [LabKit](https://gitlab.com/gitlab-org/labkit).
## Parent Epic
&20632
## Phased Delivery
### Phase 1: End-to-end integration (Feature negotiation + first span)
The goal is to validate the full pipeline — Runner → OTEL Collector → ClickHouse → Grafana — with a single `job_execution` span.
| Issue | Description |
|-------|-------------|
| gitlab-org/gitlab-runner#39231+ | Feature negotiation (`job_telemetry` feature flag + `trace_context` from job payload), OTLP export client (via LabKit), and first `job_execution` span covering the full job lifecycle |
**Dependencies**: Rails-side feature negotiation (gitlab-org/gitlab#590588+) and trace context initialization (gitlab-org/gitlab#590587+) must be implemented first.
### Phase 2: Built-in build stage spans
Instrument each built-in build stage as a child span under `job_execution`.
| Issue | Description |
|-------|-------------|
| gitlab-org/gitlab-runner#39230+ | Spans for `prepare_executor`, `pull_image`, `get_sources`, `restore_cache`, `step_script`, `after_script`, `archive_cache`, `upload_artifacts` with stage-specific metadata |
**Estimate**: Phases 1 + 2 combined: ~1 week (per @ash2k)
### Phase 3: CI Functions spans
Instrument CI Function invocations as child spans under `job_execution`.
| Issue | Description |
|-------|-------------|
| gitlab-org/gitlab-runner#39271+ | Spans for each CI Function invocation with function name, version, and status |
**Estimate**: ~2 weeks (conservative)
**Total Runner estimate: ~3 weeks** (Phases 1+2: ~1 week, Phase 3: ~2 weeks)
## Key Implementation Details
- **LabKit integration**: Use [LabKit](https://gitlab.com/gitlab-org/labkit) for OTEL SDK integration — aligns with the rest of GitLab's instrumentation
- **OTEL Collector endpoint**: Static runner manager configuration (not passed per-job)
- **OIDC/workload identity auth**: GitLab.com hosted runners authenticate directly using OIDC tokens
- **Graceful degradation**: Telemetry failures must never fail jobs
- **Streaming**: Spans are pushed as stages/functions complete (not batched at job end)
## Architecture Reference
<https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#gitlab-runner-changes>
epic