CI Job Telemetry: implement failure handling and graceful degradation
## Summary
Implement failure handling for CI Job Telemetry in the Runner so that telemetry failures are completely transparent to job execution. Telemetry is strictly best-effort and must never cause a job to fail, slow down, or change behavior.
## Requirements
| Scenario | Required behavior |
|----------|------------------|
| **Collector unreachable** | OTLP exporter retries transient errors with [exponential backoff and jitter](https://opentelemetry.io/docs/specs/otel/protocol/exporter/#retry). Export errors are reported to the global [`otel.ErrorHandler`](https://opentelemetry.io/docs/specs/otel/error-handling/#configuring-error-handlers), which increments Prometheus counters. Job execution proceeds normally. |
| **Sustained outage** | Spans buffer in the [`BatchSpanProcessor`](https://opentelemetry.io/docs/specs/otel/trace/sdk/#batching-processor) queue (bounded by `MaxQueueSize`, default 2048). When the queue is full, **new** spans are dropped on enqueue (not oldest). Drops are reported through `otel.ErrorHandler`. |
| **Post-job flush timeout** | Runner calls [`TracerProvider.Shutdown(ctx)`](https://opentelemetry.io/docs/specs/otel/trace/sdk/#shutdown) with a deadline (default: 30 seconds). If flush does not complete in time, undelivered spans are dropped and a warning is logged. |
| **Malformed/rejected spans** | Collector returns `400 Bad Request` (non-transient per [OTLP spec](https://opentelemetry.io/docs/specs/otlp/#failures-1)). OTLP exporter does not retry; error is reported to [`otel.ErrorHandler`](https://opentelemetry.io/docs/specs/otel/error-handling/#configuring-error-handlers). |
| **Auth failure (401/403)** | Runner logs the error and disables telemetry export for the remainder of the job. Job proceeds normally. |
## Design principles
- **No job impact**: Telemetry failures are invisible to the job. Exit code, artifacts, and logs are unaffected.
- **No disk I/O**: Spans are buffered in memory only. No on-disk persistence — if the Runner process crashes, buffered spans are lost. This is acceptable because telemetry gaps result in missing dashboard data, not broken pipelines.
- **Observability of telemetry itself**: The Runner exposes Prometheus metrics for telemetry health:
- `ci_telemetry_spans_exported_total`
- `ci_telemetry_spans_dropped_total`
- `ci_telemetry_export_errors_total`
## Architecture Reference
- [Failure handling](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#failure-handling)
## Parent Epic
gitlab-org&20633
issue