CI Job Telemetry Reporting - MVC
## Overview This epic tracks the MVC implementation of CI Job Telemetry Reporting — the first application of a service-agnostic OTLP-based telemetry infrastructure for GitLab. Runners push OTLP spans directly to an OTEL Collector, which writes them to the `otel_traces` table in ClickHouse (auto-created by the [OTEL Collector ClickHouse exporter](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/clickhouseexporter/internal/sqltemplates/traces_table.sql)). A Materialized View populates a `ci_job_telemetry_traces` table from `otel_traces`, filtered by `ServiceName`. ## Architecture Document - **MR**: <https://gitlab.com/gitlab-com/content-sites/handbook/-/merge_requests/17980> - **Live preview**: <https://gitlab-com.gitlab.io/content-sites/handbook/mr17980/handbook/engineering/architecture/design-documents/ci_job_telemetry/> ## Parent Work Item - <https://gitlab.com/groups/gitlab-org/quality/analytics/-/work_items/22> ## MVC Scope 1. **Runner telemetry collection**: Instrument GitLab Runner to collect timing and metadata for built-in build stages (git clone, cache, artifacts, scripts) and CI Functions 2. **OIDC/workload identity auth**: Runners on GitLab.com hosted infrastructure authenticate to the OTEL Collector using OIDC tokens or workload identity — no auth gateway needed for MVC 3. **OTEL Collector → ClickHouse**: Standard OTEL Collector with ClickHouse exporter writes OTLP spans to the `otel_traces` table (auto-created by the exporter). A [loadbalancing exporter](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/loadbalancingexporter) routes spans by `traceID` to ensure trace completeness across backends. 4. **CI telemetry Materialized View**: `ci_job_telemetry_traces` MV with denormalized CI-specific columns, populated from `otel_traces` filtered by `ServiceName` 5. **Feature negotiation**: Controlled by a **project-level feature flag** for gradual rollout. The job payload communicates only enablement (yes/no) — no sampling configuration is passed per-job. MVC targets DevExp Customer0 projects on hosted runners, then ramps up progressively. 6. **Sampling**: Configured entirely at the OTEL Collector level using the [probabilistic sampler processor](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/probabilisticsamplerprocessor). Start with a conservative rate (~10%) to validate the pipeline end-to-end, then ramp up toward 100% for the Customer0 scope. Adjustable without any Rails or Runner changes. 7. **Internal dashboards**: Enable the DevExp team (Customer 0) to build Grafana dashboards using [ClickHouse trace visualization](https://clickhouse.com/docs/observability/grafana#traces) ## Sub-Epics by Team ### MVC (Phase 1) | Team | Epic / Work Item | Description | |------|------------------|-------------| | ~"group::runner core" | &20633 | Runner basic instrumentation (~1w: feature negotiation + first `job_execution` span + built-in stage spans), then CI Functions spans (~2w) | | ~"group::ci platform" | &20945 | Feature negotiation, trace context init, MV, Rails query layer | | ~"group::Observability" | ~~OTEL Collector + ClickHouse instance ([MR !102](https://gitlab.com/gitlab-com/gl-infra/observability/clickhouse-cloud/-/merge_requests/102))~~ | ✅ ClickHouse instance enabled (MR merged 2026-02-12). OTEL Collector deployment in progress. | #### CI Platform Issues (~"group::ci platform" — &20945) | Issue | Scope | |-------|-------| | gitlab-org/gitlab#590588+ | MVC | | gitlab-org/gitlab#590587+ | MVC | | gitlab-org/gitlab#590586+ | MVC | #### Runner Core Issues (~"group::runner core" — &20633) | Issue | Milestone | Description | |-------|-----------|-------------| | gitlab-org/gitlab-runner#39231+ | Basic instrumentation | Feature negotiation, OTLP export client (via LabKit), first `job_execution` span | | gitlab-org/gitlab-runner#39230+ | Basic instrumentation | Built-in build stage span instrumentation | | gitlab-org/gitlab-runner#39271+ | CI Functions spans | CI Functions span instrumentation | ### Post-MVC | Team | Issue / Work Item | Phase | Description | |------|-------------------|-------|-------------| | ~"group::runner core" / TBD | Token-based auth gateway | Phase 2 ([Beyond GitLab.com workstream](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#workstream-beyond-gitlabcom-hosted-runners)) | Auth gateway for self-managed runners that can't use OIDC | | ~"group::ci platform" | gitlab-org/gitlab#589219+ | Phase 2 ([Beyond GitLab.com workstream](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#workstream-beyond-gitlabcom-hosted-runners)) | `POST /api/v4/internal/ci/telemetry/auth` — validates runner/job tokens | | ~"group::ci platform" | gitlab-org/gitlab#590589+ | Phase 3 ([Data consumable by users workstream](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#workstream-data-consumable-by-users)) | Rails GraphQL query layer for CI telemetry traces | | ~"group::Observability" | Self-managed/Dedicated deployment | Phase 2 ([Beyond GitLab.com workstream](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#workstream-beyond-gitlabcom-hosted-runners)) | OTEL Collector + ClickHouse via [Tenant Observability Stack](https://gitlab-com.gitlab.io/gl-infra/terraform-modules/observability/tenant-observability-stack/) (`k8s-monitoring-stack` Helm chart) | ## Key Design Decisions - **Service-agnostic backend**: `otel_traces` table is created by the standard OTEL Collector ClickHouse exporter. Any GitLab component can emit traces; CI telemetry is the first consumer via a Materialized View. - **OIDC/workload identity auth (MVC)**: GitLab.com hosted runners authenticate directly using OIDC tokens. No auth gateway or proxy needed for MVC. - **Token-based auth gateway (post-MVC)**: Self-managed runners that can't use OIDC will authenticate via a gateway that validates runner/job tokens against a Rails endpoint. - **Project-level feature flag**: Rollout controlled by a project-level feature flag — enables gradual enablement starting with Customer0 projects. - **OTEL Collector-side sampling**: Sampling rate is configured entirely at the OTEL Collector using the probabilistic sampler processor — not communicated per-job. Start with ~10% and ramp up. Adjustable without Rails or Runner changes. - **Single endpoint with trace-aware routing**: All components (Rails, Workhorse, Runner) push to a single well-known OTEL Collector endpoint. A [loadbalancing exporter](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/loadbalancingexporter) routes spans by `traceID`, ensuring all spans for a trace converge on the same backend. - **Grafana visualization**: MVC uses ClickHouse as a Grafana datasource for [trace visualization](https://clickhouse.com/docs/observability/grafana#traces). The Observability team is setting up the same approach for their own services (see [mimir tracing](https://gitlab.com/gitlab-com/gl-infra/observability/team/-/work_items/4416)). - **Standard OTEL pipeline**: No custom pipeline components — standard OTEL Collector with the ClickHouse exporter. - **OTLP format**: Industry-standard OpenTelemetry Protocol with standard fields. - **ClickStack compatibility**: The `otel_traces` schema is [functionally compatible with ClickStack](https://clickhouse.com/docs/use-cases/observability/clickstack/ingesting-data/schemas#traces), making future migration to the full ClickStack platform straightforward. - **Tenant Observability Stack for SM**: Self-managed/Dedicated deployment (Phase 2, Beyond GitLab.com workstream) will leverage the Observability team's [Tenant Observability Stack](https://gitlab-com.gitlab.io/gl-infra/terraform-modules/observability/tenant-observability-stack/) rather than shipping observability as part of GitLab itself. This avoids licensing concerns (for example, Grafana). - **Observability team collaboration**: The Observability team (~"group::Observability") provides the OTEL Collector and ClickHouse instance. Architecture designed to converge with their distributed tracing infrastructure. - **Graceful degradation**: Telemetry failures never affect job outcome. ## Staffing & Ownership | Area | Owner | Team | |------|-------|------| | Feature negotiation, trace context, MV, query layer | @pedropombeiro | ~"group::ci platform" | | Runner OTLP span emission, LabKit OTEL integration | @ash2k | ~"group::runner core" | | OTEL Collector + ClickHouse pipeline | @nduff / @e_forbes | ~"group::Observability" | ## Rollout Strategy Tracked in gitlab-org/gitlab#590939+ (`ci_job_telemetry` feature flag rollout). 1. **Customer0 scope**: Enable for DevExp Customer0 projects on GitLab.com hosted runners. Validate end-to-end pipeline (Runner → OTEL Collector → ClickHouse → MV). 2. **Expand to broader GitLab.com projects**: Gradually increase FF percentage, monitoring storage growth and Collector resource usage. 3. **GA / default-on**: Remove FF once pipeline is proven at scale. Sampling follows a [phased approach](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#sampling-strategy) — head sampling (MVC) → collector-based probabilistic → tail sampling for failed/slow jobs. ## Operational Ownership & On-call | Component | Operational owner | Notes | |-----------|-------------------|-------| | Feature flag, Rails endpoints, MV schema | ~"group::ci platform" | Standard CI Platform on-call rotation | | OTEL Collector, ClickHouse instance | ~"group::Observability" | Managed by Observability team's tenant stack | | Runner OTEL SDK integration | ~"group::runner core" | Runner team owns runner-side instrumentation | The OTEL Collector and ClickHouse instance are part of the Observability team's managed infrastructure — CI Platform does not take on-call responsibility for the telemetry pipeline itself, only for the Rails-side feature negotiation and the ClickHouse Materialized View schema. ## Monitoring & Success Metrics ### Monitoring - **OTEL Collector health**: Collector metrics (spans received/exported, queue depth, error rates) — owned by Observability team - **ClickHouse MV ingestion**: Row counts and lag in `ci_job_telemetry_traces` vs `otel_traces` - **Feature flag adoption**: Number of projects/jobs with `ci_job_telemetry` enabled ### Success metrics (MVC) | Metric | Target | |--------|--------| | End-to-end trace visibility | Traces from Customer0 CI jobs visible in Grafana within 5 minutes of job completion | | Ingestion reliability | <0.1% span drop rate at the OTEL Collector | | Query performance | p95 query latency <5s for single-job trace lookup | | Zero user-facing impact | No measurable increase in `/api/v4/jobs/request` latency or Runner job pickup time | ## Open Questions ### ClickHouse instance access from Rails For the query layer (gitlab-org/gitlab#590589) and the future Pipeline Optimization Agent, Rails needs to read trace data. Traces are ingested into the [Observability team's ClickHouse instance](https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles/-/merge_requests/102), which is separate from the main production ClickHouse instance that Rails queries via `ClickHouse::Client`. Options under discussion: 1. **Rails connects to the Observability CH instance** — single source of truth, but new infra dependency 2. **Replicate data to the main CH instance** — fits existing Rails patterns, but adds duplication/lag 3. **API/service layer** — clean separation, but extra hop and maintenance Pending [discussion with Observability team](https://gitlab.com/gitlab-com/content-sites/handbook/-/merge_requests/17980#note_3112470636). ## Timeline - **ClickHouse instance**: ✅ Observability team's instance ([MR !102](https://gitlab.com/gitlab-com/gl-infra/observability/clickhouse-cloud/-/merge_requests/102) — merged 2026-02-12) - **CI Platform (MVC)**: ~2-2.5 weeks — issues under &20945 (https://gitlab.com/gitlab-org/gitlab/-/work_items/590588 + https://gitlab.com/gitlab-org/gitlab/-/work_items/590587 must land before Runner basic instrumentation) - **Runner Core (basic instrumentation)**: ~1 week (per @ash2k) — issues under &20633
epic