CI Job Telemetry Reporting - MVC
## Overview
This epic tracks the MVC implementation of CI Job Telemetry Reporting — the first application of a service-agnostic OTLP-based telemetry infrastructure for GitLab. Runners push OTLP spans directly to an OTEL Collector, which writes them to the `otel_traces` table in ClickHouse (auto-created by the [OTEL Collector ClickHouse exporter](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/clickhouseexporter/internal/sqltemplates/traces_table.sql)). For MVC, DevExp queries `otel_traces` directly via Grafana — the product-facing `ci_job_telemetry_traces` Materialized View is deferred to Phase 3.
## Architecture Document
- **MR**: <https://gitlab.com/gitlab-com/content-sites/handbook/-/merge_requests/17980>
- **Handbook Link**: <https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/>
## Parent Work Item
- <https://gitlab.com/groups/gitlab-org/quality/analytics/-/work_items/22>
## MVC Scope
1. **Runner telemetry collection**: Instrument GitLab Runner to collect timing and metadata for built-in build stages (git clone, cache, artifacts, scripts) and CI Functions
2. **OIDC/workload identity auth**: Runners on GitLab.com hosted infrastructure authenticate to the OTEL Collector using OIDC tokens or workload identity — no auth gateway needed for MVC
3. **OTEL Collector → ClickHouse**: Standard OTEL Collector with ClickHouse exporter writes OTLP spans to the `otel_traces` table (auto-created by the exporter). A [loadbalancing exporter](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/loadbalancingexporter) routes spans by `traceID` to ensure trace completeness across backends.
4. **Rails job lifecycle spans**: Rails emits `job_lifecycle`, `job_pending`, and `job_running` spans covering the full job state machine (`created` → `pending` → `running` → `finished`). Provides end-to-end visibility including Sidekiq/`PipelineProcessWorker` delays that are invisible to the Runner. The `job_running` span's `span_id` is set as `span_parent_id` in `features.tracing` so the Runner parents its `job_execution` span under it.
5. **Feature negotiation**: Controlled by a **project-level feature flag** for gradual rollout. The job payload includes `features.tracing` with `trace_id`, `span_parent_id` (the Rails `job_running` span ID), and `otel_endpoints` (single entry for MVC — GitLab's Collector). MVC targets DevExp Customer0 projects on hosted runners, then ramps up progressively.
6. **Sampling**: Rails-side deterministic head sampling using a global application setting (`ci_job_telemetry_sampling_rate`) combined with a deterministic hash of the root pipeline ID. All jobs in a pipeline hierarchy get the same sampling decision (per-pipeline consistency, no partial traces). The Runner uses `AlwaysOn` SDK sampling — if `features.tracing` is present, it instruments everything. Adjustable without Runner or Collector changes.
7. **Internal dashboards**: Enable the DevExp team (Customer 0) to build Grafana dashboards by querying `otel_traces` directly (filtering by `ServiceName = 'gitlab-ci-runner'`) using [ClickHouse trace visualization](https://clickhouse.com/docs/observability/grafana#traces). The `ci_job_telemetry_traces` Materialized View is deferred to Phase 3.
**MVC explicitly excludes** (deferred to Phase 2, Phase 3, or future work):
- `ci_job_telemetry_traces` Materialized View → Phase 3
- BYO OTLP endpoints (customer-configured OTLP destinations) → future work
- Job Router telemetry (KAS spans) → Phase 2
- Self-managed runners reporting to GitLab.com → Phase 2
- Self-Managed and Dedicated instance deployment → Phase 2
- In-product UI visualization → Phase 3
- Automated alerting → Phase 3
- Resource usage metrics → post-MVC
- CI Functions DAG telemetry → post-MVC
## Sub-Epics by Team
### MVC (Phase 1)
| Team | Epic / Work Item | Description |
|------|------------------|-------------|
| ~"group::runner core" | &20633 | Runner basic instrumentation (~1w: feature negotiation + first `job_execution` span + built-in stage spans), then CI Functions spans (~2w) |
| ~"group::ci platform" | &20945 | Feature negotiation, trace context init, application settings, Rails job lifecycle spans |
| ~"group::Observability" | ~~OTEL Collector + ClickHouse instance ([MR !102](https://gitlab.com/gitlab-com/gl-infra/observability/clickhouse-cloud/-/merge_requests/102))~~ | ✅ ClickHouse instance enabled (MR merged 2026-02-12). OTEL Collector deployment in progress. |
#### CI Platform Issues (~"group::ci platform" — &20945)
| Issue | Status | Description |
|-------|--------|-------------|
| gitlab-org/gitlab#590588+ | ✅ Closed | Feature negotiation (Rails) |
| gitlab-org/gitlab#590587+ | ✅ Closed | Trace context initialization (Rails) |
| gitlab-org/gitlab#591941+ | ✅ Closed | OTEL Collector endpoint application setting |
| gitlab-org/gitlab#593834+ | ✅ Closed | CI telemetry sampling rate application setting |
| gitlab-org/gitlab#596774+ | In dev | Rails job lifecycle spans |
| gitlab-org/gitlab#590939+ | Rolling out | `ci_job_telemetry` feature flag rollout |
#### Runner Core Issues (~"group::runner core" — &20633)
| Issue | Status | Description |
|-------|--------|-------------|
| gitlab-org/gitlab-runner#39231+ | ✅ Closed | Feature negotiation, OTLP export client (via LabKit), first `job_execution` span |
| gitlab-org/gitlab-runner#39230+ | Ready for development | Built-in build stage span instrumentation |
| gitlab-org/gitlab-runner#39271+ | Open | CI Functions span instrumentation |
### Post-MVC
| Team | Issue / Work Item | Phase | Description |
|------|-------------------|-------|-------------|
| ~"group::ci platform" / ~"group::runner core" | &21683 | Phase 2 ([Beyond GitLab.com workstream](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#workstream-beyond-gitlabcom-hosted-runners)) | Phase 2 epic — complete telemetry pipeline (auth gateway, Job Router telemetry, self-hosted runners) |
| ~"group::ci platform" | gitlab-org/gitlab#589219+ | Phase 2 ([Beyond GitLab.com workstream](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#workstream-beyond-gitlabcom-hosted-runners)) | `POST /api/v4/internal/ci/telemetry/auth` — validates runner/job tokens |
| ~"group::ci platform" | gitlab-org/gitlab#590586+ | Phase 3 ([Data consumable by users workstream](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#workstream-data-consumable-by-users)) | `ci_job_telemetry_traces` Materialized View on production CH (deferred until query patterns are established from MVC Grafana usage) |
| ~"group::ci platform" | gitlab-org/gitlab#590589+ | Phase 3 ([Data consumable by users workstream](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#workstream-data-consumable-by-users)) | Rails GraphQL query layer for CI telemetry traces |
| ~"group::Observability" | Self-managed/Dedicated deployment | Phase 2 ([Beyond GitLab.com workstream](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#workstream-beyond-gitlabcom-hosted-runners)) | OTEL Collector + ClickHouse via [Tenant Observability Stack](https://gitlab-com.gitlab.io/gl-infra/terraform-modules/observability/tenant-observability-stack/) (`k8s-monitoring-stack` Helm chart) |
## Key Design Decisions
- **Service-agnostic backend**: `otel_traces` table is created by the standard OTEL Collector ClickHouse exporter. Any GitLab component can emit traces; CI telemetry is the first consumer. For MVC, DevExp queries `otel_traces` directly via Grafana — the `ci_job_telemetry_traces` MV is deferred to Phase 3.
- **OIDC/workload identity auth (MVC)**: GitLab.com hosted runners authenticate directly using OIDC tokens. No auth gateway or proxy needed for MVC.
- **Token-based auth gateway (post-MVC)**: Self-managed runners that can't use OIDC will authenticate via a gateway that validates runner/job tokens against a Rails endpoint.
- **Project-level feature flag**: Rollout controlled by a project-level feature flag — enables gradual enablement starting with Customer0 projects.
- **Rails-side deterministic head sampling**: Sampling is handled at Rails using a global application setting (`ci_job_telemetry_sampling_rate`) combined with a deterministic hash of the root pipeline ID. Per-pipeline consistency (all jobs share the same sampling decision). The Runner uses `AlwaysOn` SDK sampling — if `features.tracing` is present, instrument everything. Adjustable without Runner or Collector changes. Collector-side sampling layered on later (Stage 2).
- **Endpoint from Rails (`otel_endpoints`)**: The OTEL Collector endpoint URL is a Rails application setting, sent to runners in `features.tracing.otel_endpoints` (single entry for MVC). No static runner-side `config.toml` configuration needed — the feature works automatically based on runner version and namespace plan. BYO OTLP destinations are deferred to [future work](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#future-work-byo-otlp-endpoints).
- **`span_parent_id` references `job_running`**: The `job_running` span's `span_id` is set as `span_parent_id` in `features.tracing` so the Runner parents its `job_execution` span under `job_running` (not under `job_lifecycle`), producing the hierarchy `job_lifecycle` → `job_running` → `job_execution`.
- **Two-instance ClickHouse model (GitLab.com)**: Observability CH is for internal/operational use only (Grafana dashboards, cross-service trace correlation) — not exposed to end-users. Production CH receives a filtered/sampled subset for customer-facing features (GraphQL, GLQL, Duo) — Phase 3. The OTEL Collector uses separate exporter pipelines per instance.
- **Single Collector, separate exporter pipelines**: One OTEL Collector deployment serves all telemetry. Separate exporter pipelines write to different CH instances with independent filters, sampling, and retention. Observability team operates the shared Collector; each exporter pipeline is configured by its consuming team.
- **Three deployment tiers**: (a) GitLab.com hosted runners — turnkey, MVC; (b) self-hosted runners on GitLab.com — requires auth gateway, Phase 2; (c) self-managed/Dedicated — requires shipping OTEL Collector, Phase 2.
- **Standard OTEL pipeline**: No custom pipeline components — standard OTEL Collector with the ClickHouse exporter.
- **OTLP format**: Industry-standard OpenTelemetry Protocol with standard fields.
- **ClickStack compatibility**: The `otel_traces` schema is [functionally compatible with ClickStack](https://clickhouse.com/docs/use-cases/observability/clickstack/ingesting-data/schemas#traces), making future migration to the full ClickStack platform straightforward.
- **Tenant Observability Stack for SM**: Self-managed/Dedicated deployment (Phase 2, Beyond GitLab.com workstream) will leverage the Observability team's [Tenant Observability Stack](https://gitlab-com.gitlab.io/gl-infra/terraform-modules/observability/tenant-observability-stack/) rather than shipping observability as part of GitLab itself. In-product features on self-managed require shipping an OTEL Collector.
- **Observability team collaboration**: The Observability team (~"group::Observability") provides the OTEL Collector and ClickHouse instance. Architecture designed to converge with their distributed tracing infrastructure.
- **Graceful degradation**: Telemetry failures never affect job outcome.
## Staffing & Ownership
| Area | Owner | Team |
|------|-------|------|
| Feature negotiation, trace context, Rails lifecycle spans, application settings | @pedropombeiro / @narendran-kannan | ~"group::ci platform" |
| Runner OTLP span emission, LabKit OTEL integration | @ash2k | ~"group::runner core" |
| OTEL Collector + ClickHouse pipeline | @nduff / @e_forbes | ~"group::Observability" |
## Rollout Strategy
Tracked in gitlab-org/gitlab#590939+ (`ci_job_telemetry` feature flag rollout).
1. **Customer0 scope**: Enable for DevExp Customer0 projects on GitLab.com hosted runners. Validate end-to-end pipeline (Runner → OTEL Collector → ClickHouse → Grafana).
2. **Expand to broader GitLab.com projects**: Gradually increase FF percentage and `ci_job_telemetry_sampling_rate`, monitoring storage growth and Collector resource usage.
3. **GA / default-on**: Remove FF once pipeline is proven at scale.
Sampling follows a [phased approach](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#sampling-strategy) — Rails-side deterministic head sampling (MVC) → Collector-side as additional layer (Stage 2) → tail sampling for failed/slow jobs.
## Operational Ownership & On-call
| Component | Operational owner | Notes |
|-----------|-------------------|-------|
| Feature flag, application settings, Rails endpoints, lifecycle spans | ~"group::ci platform" | Standard CI Platform on-call rotation |
| OTEL Collector, ClickHouse instance | ~"group::Observability" | Managed by Observability team's tenant stack |
| Runner OTEL SDK integration | ~"group::runner core" | Runner team owns runner-side instrumentation |
The OTEL Collector and ClickHouse instance are part of the Observability team's managed infrastructure — CI Platform does not take on-call responsibility for the telemetry pipeline itself, only for the Rails-side feature negotiation, application settings, and lifecycle spans.
## Monitoring & Success Metrics
### Monitoring
- **OTEL Collector health**: Collector metrics (spans received/exported, queue depth, error rates) — owned by Observability team
- **Feature flag adoption**: Number of projects/jobs with `ci_job_telemetry` enabled
- **Sampling rate effectiveness**: Ratio of sampled vs total pipelines, span volume per Customer0 project
### Success metrics (MVC)
| Metric | Target |
|--------|--------|
| End-to-end trace visibility | Traces from Customer0 CI jobs visible in Grafana within 5 minutes of job completion |
| Ingestion reliability | <0.1% span drop rate at the OTEL Collector |
| Query performance | p95 query latency <5s for single-job trace lookup in Grafana |
| Zero user-facing impact | No measurable increase in `/api/v4/jobs/request` latency or Runner job pickup time |
## Open Questions
### ClickHouse instance access from Rails
For the future query layer (gitlab-org/gitlab#590589, Phase 3) and the Pipeline Optimization Agent, Rails will need to read trace data. The Observability CH instance is for internal use only and is not exposed to end-users. Customer-facing queries hit the **production CH instance**, which receives a filtered/sampled subset via the Collector's production-CH exporter pipeline. The product-facing `ci_job_telemetry_traces` MV (gitlab-org/gitlab#590586) is also deferred to Phase 3, on the production CH instance.
This is now a [design decision](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#design-decisions) rather than an open question — the OTEL Collector fan-out approach has been confirmed.
## Timeline
- **ClickHouse instance**: ✅ Observability team's instance ([MR !102](https://gitlab.com/gitlab-com/gl-infra/observability/clickhouse-cloud/-/merge_requests/102) — merged 2026-02-12)
- **CI Platform (MVC)**: Feature negotiation, trace context, and application settings done (Feb–Apr 2026); Rails lifecycle spans in development (#596774). Issues tracked under &20945.
- **Runner Core (basic instrumentation)**: Basic instrumentation merged (#39231 — Apr 2026); built-in stage spans (#39230) and CI Functions spans (#39271) ready/pending. Issues tracked under &20633.
epic