CI Job Telemetry - CI Platform
## Overview
This epic tracks CI Platform's work for CI Job Telemetry Reporting — feature negotiation, trace context initialization, application settings, and Rails job lifecycle spans for the MVC, plus the ClickHouse Materialized View and Rails query layer for post-MVC.
## Parent Epic
&20632
## Architecture Reference
<https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/>
## MVC Scope (Phase 1)
### 1. Feature Negotiation (Rails side) — ✅ Closed
gitlab-org/gitlab#590588+
- Include a `features.tracing` object in the job payload response (presence = enabled and sampled, absence = disabled) containing `trace_id`, `span_parent_id`, and `otel_endpoints`
- `otel_endpoints` is an array of objects (single entry for MVC) carrying a `url` and optional typed `auth` configuration ([endpoint auth schema](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#endpoint-auth-schema)). The single MVC entry is GitLab's Collector (Rails application setting). BYO OTLP destinations are deferred to [future work](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#future-work-byo-otlp-endpoints).
- Controlled by a **project-level feature flag** for gradual rollout (MVC: DevExp Customer0 projects on hosted runners)
- **Sampling** is handled at the Rails level: a global application setting (`ci_job_telemetry_sampling_rate`) combined with a deterministic hash of the root pipeline ID controls which pipelines are instrumented. No sampling configuration is passed per-job — `features.tracing` is only included for sampled pipelines.
- Coordinate with ~"group::runner core" (gitlab-org/gitlab-runner#39231+)
### 1b. OTEL Collector Endpoint Application Setting — ✅ Closed
gitlab-org/gitlab#591941+
- Add instance-level application setting `ci_telemetry_otel_endpoint` (string, nullable) for the primary OTEL Collector OTLP/HTTP endpoint URL
- When set, Rails includes the URL as the entry in `features.tracing.otel_endpoints` in job payloads
- When blank/nil, `features.tracing` is not included (telemetry disabled)
- For GitLab.com, infrastructure configures this to point to the Observability team's OTEL Collector
- BYO OTLP destinations (additional customer-configured endpoints) are deferred to [future work](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#future-work-byo-otlp-endpoints) — not part of MVC
### 1c. CI Telemetry Sampling Rate Application Setting — ✅ Closed
gitlab-org/gitlab#593834+
- Add global application setting `ci_job_telemetry_sampling_rate` (float, 0.0–1.0, default 0.0) controlling what fraction of pipelines in enabled projects are instrumented
- Applied deterministically per root pipeline ID (all jobs in a pipeline hierarchy get the same sampling decision)
- `features.tracing` is included only when: feature flag enabled for project **AND** Collector endpoint configured **AND** pipeline is sampled
### 2. Trace Context Initialization — ✅ Closed
gitlab-org/gitlab#590587+
- Rails generates `trace_id` deterministically from the **root** `pipeline_id` (ensures all jobs across parent and child pipelines share the same trace)
- Rails includes `span_parent_id` in `features.tracing` referencing the Rails `job_running` span ID, so the Runner parents its `job_execution` span under `job_running` (hierarchy: `job_lifecycle` → `job_running` → `job_execution`)
- For child pipeline jobs, the Rails `job_lifecycle` span itself is a child of the trigger (bridge) job's span — this nesting is handled internally by Rails span emission, not via `span_parent_id` in the payload
- Trace context (`trace_id` + `span_parent_id`) is included inside `features.tracing` in the job payload response — no separate `trace_context` field
- Coordinate with ~"group::runner core" (&20633) on the job payload schema
Reference: [Multi-source trace context coordination](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#multi-source-trace-context-coordination)
### 3. Rails Job Lifecycle Spans — In development
gitlab-org/gitlab#596774+
- Rails emits `job_lifecycle`, `job_pending`, and `job_running` spans covering the full job state machine (`created` → `pending` → `running` → `finished`)
- Provides end-to-end visibility including Sidekiq/`PipelineProcessWorker` delays that are invisible to the Runner
- Covers bridge jobs and external jobs out-of-the-box
- The `job_running` span's `span_id` is set as `span_parent_id` in `features.tracing` so the Runner parents its `job_execution` span under it
- Owned by ~"group::ci platform" (@narendran-kannan)
Reference: [Rails integration workstream](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#workstream-rails-integration)
### 4. Feature Flag Rollout — Rolling out
gitlab-org/gitlab#590939+
- Roll out the `ci_job_telemetry` feature flag starting with DevExp Customer0 projects on GitLab.com hosted runners
- Combined with `ci_job_telemetry_sampling_rate` to control overall span volume
**Progress**: All four foundational issues (#590588, #590587, #591941, #593834) closed in Apr 2026. Rails lifecycle spans (#596774) in development. Runner basic instrumentation (#39231) merged Apr 2026.
**Note on ClickHouse**: No ClickHouse migrations are needed for MVC. The `otel_traces` table is auto-created by the OTEL Collector ClickHouse exporter on the Observability team's instance. For MVC, DevExp queries `otel_traces` directly via Grafana. If DevExp needs Materialized Views for Grafana query performance, they own and manage those MVs on the Observability CH instance themselves.
## Post-MVC Scope
### Rails Auth Endpoint (Phase 2 — [Beyond GitLab.com workstream](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#workstream-beyond-gitlabcom-hosted-runners))
gitlab-org/gitlab#589219+
- `POST /api/v4/internal/ci/telemetry/auth` — validates runner/job tokens for the auth gateway
- Tracked in Phase 2 epic: gitlab-org&21683
### CI Telemetry Materialized View (Phase 3 — [Data consumable by users workstream](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#workstream-data-consumable-by-users))
gitlab-org/gitlab#590586+
Create the **product-facing** `ci_job_telemetry_traces` Materialized View on the **production CH instance** (not the Observability CH instance). This MV serves the Rails query layer (gitlab-org/gitlab#590589) for customer-facing features (GraphQL, GLQL, Duo). Deferred until query patterns are established from MVC Grafana dashboard usage.
- Reads from the `otel_traces` table
- Filters by `ServiceName IN ('gitlab-ci-runner', 'gitlab-ci-job-router', 'gitlab-ci-rails')`
- Denormalizes CI-specific attributes from `SpanAttributes` into typed columns (project_id, job_id, pipeline_id, runner_id, etc.)
Reference: [ClickHouse schema section](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#clickhouse-schema)
### Rails Query Layer (Phase 3 — [Data consumable by users workstream](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#workstream-data-consumable-by-users))
gitlab-org/gitlab#590589+
- GraphQL types/resolvers to expose CI telemetry traces
- Query by job_id, project_id, time range
- **Note**: Customer-facing queries hit the **production CH instance** (not the Observability CH instance, which is internal-only). The OTEL Collector's production-CH exporter pipeline writes a filtered/sampled subset for this purpose.
### Data Consumable by Users (Phase 3 — [Data consumable by users workstream](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#workstream-data-consumable-by-users))
- GraphQL API and GLQL integration for customer-facing CI telemetry queries
- Duo AI/DAP integration through existing `run_glql_query` tool
### Alerting (Phase 3 — [Alerting workstream](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#workstream-alerting))
- Aggregated metrics MVs for alerting and dashboards (cache hit rates, p50/p95 durations)
- Baseline alerting on metric deviations
## Prior Art
The Observability team's [Mimir tracing rollout](https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/4416#note_3104250219) validated the OTEL → ClickHouse pipeline at production scale:
- 1% head sampling on Mimir (~700k RPS) with zero performance impact on the instrumented service
- OTEL Gateway: ~0.3–0.5 CPU cores, ~300–450 MiB per pod — lightweight and stable
- ClickHouse (smallest cloud instance): ~12k–16k spans/sec, 5.15B spans in 266 GiB (~56 bytes/span compressed), 1–2 sec query latency
This provides a strong confidence signal for our ClickHouse + OTEL Collector pipeline approach.
## Dependencies
| Dependency | Team | Status |
|------------|------|--------|
| OTEL Collector + ClickHouse instance | ~"group::Observability" | ✅ ClickHouse instance enabled ([MR !102](https://gitlab.com/gitlab-com/gl-infra/observability/clickhouse-cloud/-/merge_requests/102) — merged 2026-02-12) |
| Runner OTLP span emission | ~"group::runner core" (&20633) | Basic instrumentation merged (#39231 — Apr 2026); stage spans pending |
| `otel_traces` table creation | OTEL Collector ClickHouse exporter (automatic) | Automatic on first span |
| LabKit OTEL integration for Rails | ~"group::Observability" | ✅ [labkit-ruby!228](https://gitlab.com/gitlab-org/ruby/gems/labkit-ruby/-/merge_requests/228) merged 2026-02-13 |
epic