CI Job Telemetry - CI Platform (MV, Query Layer, UI)
## Overview This epic tracks CI Platform's work for CI Job Telemetry Reporting — feature negotiation, trace context initialization, the ClickHouse Materialized View, and the Rails query layer. ## Parent Epic &20632 ## Architecture Reference <https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/> ## MVC Scope ### 1. Feature Negotiation (Rails side) — ~1-2 days gitlab-org/gitlab#590588+ - Report `job_telemetry` feature availability (enabled yes/no) in the job payload response - Controlled by a **project-level feature flag** for gradual rollout (MVC: DevExp Customer0 projects on hosted runners) - No sampling configuration is passed per-job — sampling is handled entirely at the OTEL Collector level - The OTEL Collector endpoint is not passed per-job — it is statically configured on each runner manager - Coordinate with ~"group::runner core" (gitlab-org/gitlab-runner#39231+) ### 2. Trace Context Initialization — ~3-4 days gitlab-org/gitlab#590587+ - Rails generates `trace_id` deterministically from `job_id` (ensures all spans for a job share the same trace) - Pass trace context to Runner in the job payload response - Coordinate with ~"group::runner core" (&20633) on the job payload schema Reference: [Multi-source trace context coordination](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#multi-source-trace-context-coordination) ### 3. CI Telemetry Materialized View — ~1 week gitlab-org/gitlab#590586+ Create a `ci_job_telemetry_traces` Materialized View on the Observability team's ClickHouse instance that: - Reads from the `otel_traces` table (auto-created by the OTEL Collector ClickHouse exporter) - Filters by `ServiceName = 'ci-job-telemetry'` - Denormalizes CI-specific attributes from `SpanAttributes` into typed columns (project_id, job_id, pipeline_id, runner_id, etc.) - Supports efficient queries by job_id, project_id, and time range Reference: [ClickHouse schema section](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#clickhouse-schema) **Total MVC estimate: ~2-2.5 weeks**. Issues #590588 and #590587 are prerequisites for Runner Phase 1 (gitlab-org/gitlab-runner#39231+) and should land first. ## Post-MVC Scope ### Rails Query Layer gitlab-org/gitlab#590589+ - GraphQL types/resolvers to expose CI telemetry traces - Query by job_id, project_id, time range ### Rails Auth Endpoint (Phase 4a) gitlab-org/gitlab#589219+ - `POST /api/v4/internal/ci/telemetry/auth` — validates runner/job tokens for the auth gateway ### Aggregated Metrics (Phase 5) - Hourly/daily Materialized Views for alerting and dashboards (cache hit rates, p50/p95 durations) - Reference: [Phase 5 section](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#phase-5-aggregated-metrics-via-materialized-views-for-alerting) ### AI/Duo Integration (Phase 8) - GLQL tool (`run_glql_query`) for Duo AI to query CI telemetry - Reference: [Phase 8 section](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#phase-8-aiduo-integration) ## Prior Art The Observability team's [Mimir tracing rollout](https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/4416#note_3104250219) validated the OTEL → ClickHouse pipeline at production scale: - 1% head sampling on Mimir (~700k RPS) with zero performance impact on the instrumented service - OTEL Gateway: ~0.3–0.5 CPU cores, ~300–450 MiB per pod — lightweight and stable - ClickHouse (smallest cloud instance): ~12k–16k spans/sec, 5.15B spans in 266 GiB (~56 bytes/span compressed), 1–2 sec query latency This provides a strong confidence signal for our ClickHouse + OTEL Collector pipeline approach. ## Dependencies | Dependency | Team | Status | |------------|------|--------| | OTEL Collector + ClickHouse instance | ~"group::Observability" | In progress ([MR !102](https://gitlab.com/gitlab-com/gl-infra/observability/clickhouse-cloud/-/merge_requests/102)) | | Runner OTLP span emission | ~"group::runner core" (&20633) | Pending | | `otel_traces` table creation | OTEL Collector ClickHouse exporter (automatic) | Automatic on first span |
epic