CI Job Telemetry: implement failure handling and graceful degradation (#39278) · Issues · GitLab.org / gitlab-runner

CI Job Telemetry: implement failure handling and graceful degradation

## Summary Implement failure handling for CI Job Telemetry in the Runner so that telemetry failures are completely transparent to job execution. Telemetry is strictly best-effort and must never cause a job to fail, slow down, or change behavior. ## Requirements | Scenario | Required behavior | |----------|------------------| | **Collector unreachable** | OTLP exporter retries transient errors with [exponential backoff and jitter](https://opentelemetry.io/docs/specs/otel/protocol/exporter/#retry). Export errors are reported to the global [`otel.ErrorHandler`](https://opentelemetry.io/docs/specs/otel/error-handling/#configuring-error-handlers), which increments Prometheus counters. Job execution proceeds normally. | | **Sustained outage** | Spans buffer in the [`BatchSpanProcessor`](https://opentelemetry.io/docs/specs/otel/trace/sdk/#batching-processor) queue (bounded by `MaxQueueSize`, default 2048). When the queue is full, **new** spans are dropped on enqueue (not oldest). Drops are reported through `otel.ErrorHandler`. | | **Post-job flush timeout** | Runner calls [`TracerProvider.Shutdown(ctx)`](https://opentelemetry.io/docs/specs/otel/trace/sdk/#shutdown) with a deadline (default: 30 seconds). If flush does not complete in time, undelivered spans are dropped and a warning is logged. | | **Malformed/rejected spans** | Collector returns `400 Bad Request` (non-transient per [OTLP spec](https://opentelemetry.io/docs/specs/otlp/#failures-1)). OTLP exporter does not retry; error is reported to [`otel.ErrorHandler`](https://opentelemetry.io/docs/specs/otel/error-handling/#configuring-error-handlers). | | **Auth failure (401/403)** | Runner logs the error and disables telemetry export for the remainder of the job. Job proceeds normally. | ## Design principles - **No job impact**: Telemetry failures are invisible to the job. Exit code, artifacts, and logs are unaffected. - **No disk I/O**: Spans are buffered in memory only. No on-disk persistence — if the Runner process crashes, buffered spans are lost. This is acceptable because telemetry gaps result in missing dashboard data, not broken pipelines. - **Observability of telemetry itself**: The Runner exposes Prometheus metrics for telemetry health: - `ci_telemetry_spans_exported_total` - `ci_telemetry_spans_dropped_total` - `ci_telemetry_export_errors_total` ## Architecture Reference - [Failure handling](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#failure-handling) ## Parent Epic gitlab-org&20633

issue