CI Job Telemetry - CI Platform
## Overview This epic tracks CI Platform's work for CI Job Telemetry Reporting — feature negotiation, trace context initialization, application settings, and Rails job lifecycle spans for the MVC, plus the ClickHouse Materialized View and Rails query layer for post-MVC. ## Parent Epic &20632 ## Architecture Reference <https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/> ## MVC Scope (Phase 1) ### 1. Feature Negotiation (Rails side) — ✅ Closed gitlab-org/gitlab#590588+ - Include a `features.tracing` object in the job payload response (presence = enabled and sampled, absence = disabled) containing `trace_id`, `span_parent_id`, and `otel_endpoints` - `otel_endpoints` is an array of objects (single entry for MVC) carrying a `url` and optional typed `auth` configuration ([endpoint auth schema](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#endpoint-auth-schema)). The single MVC entry is GitLab's Collector (Rails application setting). BYO OTLP destinations are deferred to [future work](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#future-work-byo-otlp-endpoints). - Controlled by a **project-level feature flag** for gradual rollout (MVC: DevExp Customer0 projects on hosted runners) - **Sampling** is handled at the Rails level: a global application setting (`ci_job_telemetry_sampling_rate`) combined with a deterministic hash of the root pipeline ID controls which pipelines are instrumented. No sampling configuration is passed per-job — `features.tracing` is only included for sampled pipelines. - Coordinate with ~"group::runner core" (gitlab-org/gitlab-runner#39231+) ### 1b. OTEL Collector Endpoint Application Setting — ✅ Closed gitlab-org/gitlab#591941+ - Add instance-level application setting `ci_telemetry_otel_endpoint` (string, nullable) for the primary OTEL Collector OTLP/HTTP endpoint URL - When set, Rails includes the URL as the entry in `features.tracing.otel_endpoints` in job payloads - When blank/nil, `features.tracing` is not included (telemetry disabled) - For GitLab.com, infrastructure configures this to point to the Observability team's OTEL Collector - BYO OTLP destinations (additional customer-configured endpoints) are deferred to [future work](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#future-work-byo-otlp-endpoints) — not part of MVC ### 1c. CI Telemetry Sampling Rate Application Setting — ✅ Closed gitlab-org/gitlab#593834+ - Add global application setting `ci_job_telemetry_sampling_rate` (float, 0.0–1.0, default 0.0) controlling what fraction of pipelines in enabled projects are instrumented - Applied deterministically per root pipeline ID (all jobs in a pipeline hierarchy get the same sampling decision) - `features.tracing` is included only when: feature flag enabled for project **AND** Collector endpoint configured **AND** pipeline is sampled ### 2. Trace Context Initialization — ✅ Closed gitlab-org/gitlab#590587+ - Rails generates `trace_id` deterministically from the **root** `pipeline_id` (ensures all jobs across parent and child pipelines share the same trace) - Rails includes `span_parent_id` in `features.tracing` referencing the Rails `job_running` span ID, so the Runner parents its `job_execution` span under `job_running` (hierarchy: `job_lifecycle` → `job_running` → `job_execution`) - For child pipeline jobs, the Rails `job_lifecycle` span itself is a child of the trigger (bridge) job's span — this nesting is handled internally by Rails span emission, not via `span_parent_id` in the payload - Trace context (`trace_id` + `span_parent_id`) is included inside `features.tracing` in the job payload response — no separate `trace_context` field - Coordinate with ~"group::runner core" (&20633) on the job payload schema Reference: [Multi-source trace context coordination](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#multi-source-trace-context-coordination) ### 3. Rails Job Lifecycle Spans — In development gitlab-org/gitlab#596774+ - Rails emits `job_lifecycle`, `job_pending`, and `job_running` spans covering the full job state machine (`created` → `pending` → `running` → `finished`) - Provides end-to-end visibility including Sidekiq/`PipelineProcessWorker` delays that are invisible to the Runner - Covers bridge jobs and external jobs out-of-the-box - The `job_running` span's `span_id` is set as `span_parent_id` in `features.tracing` so the Runner parents its `job_execution` span under it - Owned by ~"group::ci platform" (@narendran-kannan) Reference: [Rails integration workstream](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#workstream-rails-integration) ### 4. Feature Flag Rollout — Rolling out gitlab-org/gitlab#590939+ - Roll out the `ci_job_telemetry` feature flag starting with DevExp Customer0 projects on GitLab.com hosted runners - Combined with `ci_job_telemetry_sampling_rate` to control overall span volume **Progress**: All four foundational issues (#590588, #590587, #591941, #593834) closed in Apr 2026. Rails lifecycle spans (#596774) in development. Runner basic instrumentation (#39231) merged Apr 2026. **Note on ClickHouse**: No ClickHouse migrations are needed for MVC. The `otel_traces` table is auto-created by the OTEL Collector ClickHouse exporter on the Observability team's instance. For MVC, DevExp queries `otel_traces` directly via Grafana. If DevExp needs Materialized Views for Grafana query performance, they own and manage those MVs on the Observability CH instance themselves. ## Post-MVC Scope ### Rails Auth Endpoint (Phase 2 — [Beyond GitLab.com workstream](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#workstream-beyond-gitlabcom-hosted-runners)) gitlab-org/gitlab#589219+ - `POST /api/v4/internal/ci/telemetry/auth` — validates runner/job tokens for the auth gateway - Tracked in Phase 2 epic: gitlab-org&21683 ### CI Telemetry Materialized View (Phase 3 — [Data consumable by users workstream](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#workstream-data-consumable-by-users)) gitlab-org/gitlab#590586+ Create the **product-facing** `ci_job_telemetry_traces` Materialized View on the **production CH instance** (not the Observability CH instance). This MV serves the Rails query layer (gitlab-org/gitlab#590589) for customer-facing features (GraphQL, GLQL, Duo). Deferred until query patterns are established from MVC Grafana dashboard usage. - Reads from the `otel_traces` table - Filters by `ServiceName IN ('gitlab-ci-runner', 'gitlab-ci-job-router', 'gitlab-ci-rails')` - Denormalizes CI-specific attributes from `SpanAttributes` into typed columns (project_id, job_id, pipeline_id, runner_id, etc.) Reference: [ClickHouse schema section](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#clickhouse-schema) ### Rails Query Layer (Phase 3 — [Data consumable by users workstream](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#workstream-data-consumable-by-users)) gitlab-org/gitlab#590589+ - GraphQL types/resolvers to expose CI telemetry traces - Query by job_id, project_id, time range - **Note**: Customer-facing queries hit the **production CH instance** (not the Observability CH instance, which is internal-only). The OTEL Collector's production-CH exporter pipeline writes a filtered/sampled subset for this purpose. ### Data Consumable by Users (Phase 3 — [Data consumable by users workstream](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#workstream-data-consumable-by-users)) - GraphQL API and GLQL integration for customer-facing CI telemetry queries - Duo AI/DAP integration through existing `run_glql_query` tool ### Alerting (Phase 3 — [Alerting workstream](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#workstream-alerting)) - Aggregated metrics MVs for alerting and dashboards (cache hit rates, p50/p95 durations) - Baseline alerting on metric deviations ## Prior Art The Observability team's [Mimir tracing rollout](https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/4416#note_3104250219) validated the OTEL → ClickHouse pipeline at production scale: - 1% head sampling on Mimir (~700k RPS) with zero performance impact on the instrumented service - OTEL Gateway: ~0.3–0.5 CPU cores, ~300–450 MiB per pod — lightweight and stable - ClickHouse (smallest cloud instance): ~12k–16k spans/sec, 5.15B spans in 266 GiB (~56 bytes/span compressed), 1–2 sec query latency This provides a strong confidence signal for our ClickHouse + OTEL Collector pipeline approach. ## Dependencies | Dependency | Team | Status | |------------|------|--------| | OTEL Collector + ClickHouse instance | ~"group::Observability" | ✅ ClickHouse instance enabled ([MR !102](https://gitlab.com/gitlab-com/gl-infra/observability/clickhouse-cloud/-/merge_requests/102) — merged 2026-02-12) | | Runner OTLP span emission | ~"group::runner core" (&20633) | Basic instrumentation merged (#39231 — Apr 2026); stage spans pending | | `otel_traces` table creation | OTEL Collector ClickHouse exporter (automatic) | Automatic on first span | | LabKit OTEL integration for Rails | ~"group::Observability" | ✅ [labkit-ruby!228](https://gitlab.com/gitlab-org/ruby/gems/labkit-ruby/-/merge_requests/228) merged 2026-02-13 |
epic