CI Job Telemetry Reporting - MVC
## Overview This epic tracks the MVC implementation of CI Job Telemetry Reporting — the first application of a service-agnostic OTLP-based telemetry infrastructure for GitLab. Runners push OTLP spans directly to an OTEL Collector, which writes them to the `otel_traces` table in ClickHouse (auto-created by the [OTEL Collector ClickHouse exporter](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/clickhouseexporter/internal/sqltemplates/traces_table.sql)). For MVC, DevExp queries `otel_traces` directly via Grafana — the product-facing `ci_job_telemetry_traces` Materialized View is deferred to Phase 3. ## Architecture Document - **MR**: <https://gitlab.com/gitlab-com/content-sites/handbook/-/merge_requests/17980> - **Handbook Link**: <https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/> ## Parent Work Item - <https://gitlab.com/groups/gitlab-org/quality/analytics/-/work_items/22> ## MVC Scope 1. **Runner telemetry collection**: Instrument GitLab Runner to collect timing and metadata for built-in build stages (git clone, cache, artifacts, scripts) and CI Functions 2. **OIDC/workload identity auth**: Runners on GitLab.com hosted infrastructure authenticate to the OTEL Collector using OIDC tokens or workload identity — no auth gateway needed for MVC 3. **OTEL Collector → ClickHouse**: Standard OTEL Collector with ClickHouse exporter writes OTLP spans to the `otel_traces` table (auto-created by the exporter). A [loadbalancing exporter](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/loadbalancingexporter) routes spans by `traceID` to ensure trace completeness across backends. 4. **Rails job lifecycle spans**: Rails emits `job_lifecycle`, `job_pending`, and `job_running` spans covering the full job state machine (`created` → `pending` → `running` → `finished`). Provides end-to-end visibility including Sidekiq/`PipelineProcessWorker` delays that are invisible to the Runner. The `job_running` span's `span_id` is set as `span_parent_id` in `features.tracing` so the Runner parents its `job_execution` span under it. 5. **Feature negotiation**: Controlled by a **project-level feature flag** for gradual rollout. The job payload includes `features.tracing` with `trace_id`, `span_parent_id` (the Rails `job_running` span ID), and `otel_endpoints` (single entry for MVC — GitLab's Collector). MVC targets DevExp Customer0 projects on hosted runners, then ramps up progressively. 6. **Sampling**: Rails-side deterministic head sampling using a global application setting (`ci_job_telemetry_sampling_rate`) combined with a deterministic hash of the root pipeline ID. All jobs in a pipeline hierarchy get the same sampling decision (per-pipeline consistency, no partial traces). The Runner uses `AlwaysOn` SDK sampling — if `features.tracing` is present, it instruments everything. Adjustable without Runner or Collector changes. 7. **Internal dashboards**: Enable the DevExp team (Customer 0) to build Grafana dashboards by querying `otel_traces` directly (filtering by `ServiceName = 'gitlab-ci-runner'`) using [ClickHouse trace visualization](https://clickhouse.com/docs/observability/grafana#traces). The `ci_job_telemetry_traces` Materialized View is deferred to Phase 3. **MVC explicitly excludes** (deferred to Phase 2, Phase 3, or future work): - `ci_job_telemetry_traces` Materialized View → Phase 3 - BYO OTLP endpoints (customer-configured OTLP destinations) → future work - Job Router telemetry (KAS spans) → Phase 2 - Self-managed runners reporting to GitLab.com → Phase 2 - Self-Managed and Dedicated instance deployment → Phase 2 - In-product UI visualization → Phase 3 - Automated alerting → Phase 3 - Resource usage metrics → post-MVC - CI Functions DAG telemetry → post-MVC ## Sub-Epics by Team ### MVC (Phase 1) | Team | Epic / Work Item | Description | |------|------------------|-------------| | ~"group::runner core" | &20633 | Runner basic instrumentation (~1w: feature negotiation + first `job_execution` span + built-in stage spans), then CI Functions spans (~2w) | | ~"group::ci platform" | &20945 | Feature negotiation, trace context init, application settings, Rails job lifecycle spans | | ~"group::Observability" | ~~OTEL Collector + ClickHouse instance ([MR !102](https://gitlab.com/gitlab-com/gl-infra/observability/clickhouse-cloud/-/merge_requests/102))~~ | ✅ ClickHouse instance enabled (MR merged 2026-02-12). OTEL Collector deployment in progress. | #### CI Platform Issues (~"group::ci platform" — &20945) | Issue | Status | Description | |-------|--------|-------------| | gitlab-org/gitlab#590588+ | ✅ Closed | Feature negotiation (Rails) | | gitlab-org/gitlab#590587+ | ✅ Closed | Trace context initialization (Rails) | | gitlab-org/gitlab#591941+ | ✅ Closed | OTEL Collector endpoint application setting | | gitlab-org/gitlab#593834+ | ✅ Closed | CI telemetry sampling rate application setting | | gitlab-org/gitlab#596774+ | In dev | Rails job lifecycle spans | | gitlab-org/gitlab#590939+ | Rolling out | `ci_job_telemetry` feature flag rollout | #### Runner Core Issues (~"group::runner core" — &20633) | Issue | Status | Description | |-------|--------|-------------| | gitlab-org/gitlab-runner#39231+ | ✅ Closed | Feature negotiation, OTLP export client (via LabKit), first `job_execution` span | | gitlab-org/gitlab-runner#39230+ | Ready for development | Built-in build stage span instrumentation | | gitlab-org/gitlab-runner#39271+ | Open | CI Functions span instrumentation | ### Post-MVC | Team | Issue / Work Item | Phase | Description | |------|-------------------|-------|-------------| | ~"group::ci platform" / ~"group::runner core" | &21683 | Phase 2 ([Beyond GitLab.com workstream](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#workstream-beyond-gitlabcom-hosted-runners)) | Phase 2 epic — complete telemetry pipeline (auth gateway, Job Router telemetry, self-hosted runners) | | ~"group::ci platform" | gitlab-org/gitlab#589219+ | Phase 2 ([Beyond GitLab.com workstream](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#workstream-beyond-gitlabcom-hosted-runners)) | `POST /api/v4/internal/ci/telemetry/auth` — validates runner/job tokens | | ~"group::ci platform" | gitlab-org/gitlab#590586+ | Phase 3 ([Data consumable by users workstream](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#workstream-data-consumable-by-users)) | `ci_job_telemetry_traces` Materialized View on production CH (deferred until query patterns are established from MVC Grafana usage) | | ~"group::ci platform" | gitlab-org/gitlab#590589+ | Phase 3 ([Data consumable by users workstream](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#workstream-data-consumable-by-users)) | Rails GraphQL query layer for CI telemetry traces | | ~"group::Observability" | Self-managed/Dedicated deployment | Phase 2 ([Beyond GitLab.com workstream](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#workstream-beyond-gitlabcom-hosted-runners)) | OTEL Collector + ClickHouse via [Tenant Observability Stack](https://gitlab-com.gitlab.io/gl-infra/terraform-modules/observability/tenant-observability-stack/) (`k8s-monitoring-stack` Helm chart) | ## Key Design Decisions - **Service-agnostic backend**: `otel_traces` table is created by the standard OTEL Collector ClickHouse exporter. Any GitLab component can emit traces; CI telemetry is the first consumer. For MVC, DevExp queries `otel_traces` directly via Grafana — the `ci_job_telemetry_traces` MV is deferred to Phase 3. - **OIDC/workload identity auth (MVC)**: GitLab.com hosted runners authenticate directly using OIDC tokens. No auth gateway or proxy needed for MVC. - **Token-based auth gateway (post-MVC)**: Self-managed runners that can't use OIDC will authenticate via a gateway that validates runner/job tokens against a Rails endpoint. - **Project-level feature flag**: Rollout controlled by a project-level feature flag — enables gradual enablement starting with Customer0 projects. - **Rails-side deterministic head sampling**: Sampling is handled at Rails using a global application setting (`ci_job_telemetry_sampling_rate`) combined with a deterministic hash of the root pipeline ID. Per-pipeline consistency (all jobs share the same sampling decision). The Runner uses `AlwaysOn` SDK sampling — if `features.tracing` is present, instrument everything. Adjustable without Runner or Collector changes. Collector-side sampling layered on later (Stage 2). - **Endpoint from Rails (`otel_endpoints`)**: The OTEL Collector endpoint URL is a Rails application setting, sent to runners in `features.tracing.otel_endpoints` (single entry for MVC). No static runner-side `config.toml` configuration needed — the feature works automatically based on runner version and namespace plan. BYO OTLP destinations are deferred to [future work](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#future-work-byo-otlp-endpoints). - **`span_parent_id` references `job_running`**: The `job_running` span's `span_id` is set as `span_parent_id` in `features.tracing` so the Runner parents its `job_execution` span under `job_running` (not under `job_lifecycle`), producing the hierarchy `job_lifecycle` → `job_running` → `job_execution`. - **Two-instance ClickHouse model (GitLab.com)**: Observability CH is for internal/operational use only (Grafana dashboards, cross-service trace correlation) — not exposed to end-users. Production CH receives a filtered/sampled subset for customer-facing features (GraphQL, GLQL, Duo) — Phase 3. The OTEL Collector uses separate exporter pipelines per instance. - **Single Collector, separate exporter pipelines**: One OTEL Collector deployment serves all telemetry. Separate exporter pipelines write to different CH instances with independent filters, sampling, and retention. Observability team operates the shared Collector; each exporter pipeline is configured by its consuming team. - **Three deployment tiers**: (a) GitLab.com hosted runners — turnkey, MVC; (b) self-hosted runners on GitLab.com — requires auth gateway, Phase 2; (c) self-managed/Dedicated — requires shipping OTEL Collector, Phase 2. - **Standard OTEL pipeline**: No custom pipeline components — standard OTEL Collector with the ClickHouse exporter. - **OTLP format**: Industry-standard OpenTelemetry Protocol with standard fields. - **ClickStack compatibility**: The `otel_traces` schema is [functionally compatible with ClickStack](https://clickhouse.com/docs/use-cases/observability/clickstack/ingesting-data/schemas#traces), making future migration to the full ClickStack platform straightforward. - **Tenant Observability Stack for SM**: Self-managed/Dedicated deployment (Phase 2, Beyond GitLab.com workstream) will leverage the Observability team's [Tenant Observability Stack](https://gitlab-com.gitlab.io/gl-infra/terraform-modules/observability/tenant-observability-stack/) rather than shipping observability as part of GitLab itself. In-product features on self-managed require shipping an OTEL Collector. - **Observability team collaboration**: The Observability team (~"group::Observability") provides the OTEL Collector and ClickHouse instance. Architecture designed to converge with their distributed tracing infrastructure. - **Graceful degradation**: Telemetry failures never affect job outcome. ## Staffing & Ownership | Area | Owner | Team | |------|-------|------| | Feature negotiation, trace context, Rails lifecycle spans, application settings | @pedropombeiro / @narendran-kannan | ~"group::ci platform" | | Runner OTLP span emission, LabKit OTEL integration | @ash2k | ~"group::runner core" | | OTEL Collector + ClickHouse pipeline | @nduff / @e_forbes | ~"group::Observability" | ## Rollout Strategy Tracked in gitlab-org/gitlab#590939+ (`ci_job_telemetry` feature flag rollout). 1. **Customer0 scope**: Enable for DevExp Customer0 projects on GitLab.com hosted runners. Validate end-to-end pipeline (Runner → OTEL Collector → ClickHouse → Grafana). 2. **Expand to broader GitLab.com projects**: Gradually increase FF percentage and `ci_job_telemetry_sampling_rate`, monitoring storage growth and Collector resource usage. 3. **GA / default-on**: Remove FF once pipeline is proven at scale. Sampling follows a [phased approach](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#sampling-strategy) — Rails-side deterministic head sampling (MVC) → Collector-side as additional layer (Stage 2) → tail sampling for failed/slow jobs. ## Operational Ownership & On-call | Component | Operational owner | Notes | |-----------|-------------------|-------| | Feature flag, application settings, Rails endpoints, lifecycle spans | ~"group::ci platform" | Standard CI Platform on-call rotation | | OTEL Collector, ClickHouse instance | ~"group::Observability" | Managed by Observability team's tenant stack | | Runner OTEL SDK integration | ~"group::runner core" | Runner team owns runner-side instrumentation | The OTEL Collector and ClickHouse instance are part of the Observability team's managed infrastructure — CI Platform does not take on-call responsibility for the telemetry pipeline itself, only for the Rails-side feature negotiation, application settings, and lifecycle spans. ## Monitoring & Success Metrics ### Monitoring - **OTEL Collector health**: Collector metrics (spans received/exported, queue depth, error rates) — owned by Observability team - **Feature flag adoption**: Number of projects/jobs with `ci_job_telemetry` enabled - **Sampling rate effectiveness**: Ratio of sampled vs total pipelines, span volume per Customer0 project ### Success metrics (MVC) | Metric | Target | |--------|--------| | End-to-end trace visibility | Traces from Customer0 CI jobs visible in Grafana within 5 minutes of job completion | | Ingestion reliability | <0.1% span drop rate at the OTEL Collector | | Query performance | p95 query latency <5s for single-job trace lookup in Grafana | | Zero user-facing impact | No measurable increase in `/api/v4/jobs/request` latency or Runner job pickup time | ## Open Questions ### ClickHouse instance access from Rails For the future query layer (gitlab-org/gitlab#590589, Phase 3) and the Pipeline Optimization Agent, Rails will need to read trace data. The Observability CH instance is for internal use only and is not exposed to end-users. Customer-facing queries hit the **production CH instance**, which receives a filtered/sampled subset via the Collector's production-CH exporter pipeline. The product-facing `ci_job_telemetry_traces` MV (gitlab-org/gitlab#590586) is also deferred to Phase 3, on the production CH instance. This is now a [design decision](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#design-decisions) rather than an open question — the OTEL Collector fan-out approach has been confirmed. ## Timeline - **ClickHouse instance**: ✅ Observability team's instance ([MR !102](https://gitlab.com/gitlab-com/gl-infra/observability/clickhouse-cloud/-/merge_requests/102) — merged 2026-02-12) - **CI Platform (MVC)**: Feature negotiation, trace context, and application settings done (Feb–Apr 2026); Rails lifecycle spans in development (#596774). Issues tracked under &20945. - **Runner Core (basic instrumentation)**: Basic instrumentation merged (#39231 — Apr 2026); built-in stage spans (#39230) and CI Functions spans (#39271) ready/pending. Issues tracked under &20633.
epic