CI Job Telemetry - CI Platform (MV, Query Layer, UI)
## Overview
This epic tracks CI Platform's work for CI Job Telemetry Reporting — feature negotiation, trace context initialization, the ClickHouse Materialized View, and the Rails query layer.
## Parent Epic
&20632
## Architecture Reference
<https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/>
## MVC Scope
### 1. Feature Negotiation (Rails side) — ~1-2 days
gitlab-org/gitlab#590588+
- Report `job_telemetry` feature availability (enabled yes/no) in the job payload response
- Controlled by a **project-level feature flag** for gradual rollout (MVC: DevExp Customer0 projects on hosted runners)
- No sampling configuration is passed per-job — sampling is handled entirely at the OTEL Collector level
- The OTEL Collector endpoint is not passed per-job — it is statically configured on each runner manager
- Coordinate with ~"group::runner core" (gitlab-org/gitlab-runner#39231+)
### 2. Trace Context Initialization — ~3-4 days
gitlab-org/gitlab#590587+
- Rails generates `trace_id` deterministically from `job_id` (ensures all spans for a job share the same trace)
- Pass trace context to Runner in the job payload response
- Coordinate with ~"group::runner core" (&20633) on the job payload schema
Reference: [Multi-source trace context coordination](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#multi-source-trace-context-coordination)
### 3. CI Telemetry Materialized View — ~1 week
gitlab-org/gitlab#590586+
Create a `ci_job_telemetry_traces` Materialized View on the Observability team's ClickHouse instance that:
- Reads from the `otel_traces` table (auto-created by the OTEL Collector ClickHouse exporter)
- Filters by `ServiceName = 'ci-job-telemetry'`
- Denormalizes CI-specific attributes from `SpanAttributes` into typed columns (project_id, job_id, pipeline_id, runner_id, etc.)
- Supports efficient queries by job_id, project_id, and time range
Reference: [ClickHouse schema section](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#clickhouse-schema)
**Total MVC estimate: ~2-2.5 weeks**. Issues #590588 and #590587 are prerequisites for Runner Phase 1 (gitlab-org/gitlab-runner#39231+) and should land first.
## Post-MVC Scope
### Rails Query Layer
gitlab-org/gitlab#590589+
- GraphQL types/resolvers to expose CI telemetry traces
- Query by job_id, project_id, time range
### Rails Auth Endpoint (Phase 4a)
gitlab-org/gitlab#589219+
- `POST /api/v4/internal/ci/telemetry/auth` — validates runner/job tokens for the auth gateway
### Aggregated Metrics (Phase 5)
- Hourly/daily Materialized Views for alerting and dashboards (cache hit rates, p50/p95 durations)
- Reference: [Phase 5 section](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#phase-5-aggregated-metrics-via-materialized-views-for-alerting)
### AI/Duo Integration (Phase 8)
- GLQL tool (`run_glql_query`) for Duo AI to query CI telemetry
- Reference: [Phase 8 section](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#phase-8-aiduo-integration)
## Prior Art
The Observability team's [Mimir tracing rollout](https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/4416#note_3104250219) validated the OTEL → ClickHouse pipeline at production scale:
- 1% head sampling on Mimir (~700k RPS) with zero performance impact on the instrumented service
- OTEL Gateway: ~0.3–0.5 CPU cores, ~300–450 MiB per pod — lightweight and stable
- ClickHouse (smallest cloud instance): ~12k–16k spans/sec, 5.15B spans in 266 GiB (~56 bytes/span compressed), 1–2 sec query latency
This provides a strong confidence signal for our ClickHouse + OTEL Collector pipeline approach.
## Dependencies
| Dependency | Team | Status |
|------------|------|--------|
| OTEL Collector + ClickHouse instance | ~"group::Observability" | In progress ([MR !102](https://gitlab.com/gitlab-com/gl-infra/observability/clickhouse-cloud/-/merge_requests/102)) |
| Runner OTLP span emission | ~"group::runner core" (&20633) | Pending |
| `otel_traces` table creation | OTEL Collector ClickHouse exporter (automatic) | Automatic on first span |
epic