CI Job Telemetry Reporting - MVC
## Overview
This epic tracks the MVC implementation of CI Job Telemetry Reporting — the first application of a service-agnostic OTLP-based telemetry infrastructure for GitLab. Runners push OTLP spans directly to an OTEL Collector, which writes them to the `otel_traces` table in ClickHouse (auto-created by the [OTEL Collector ClickHouse exporter](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/clickhouseexporter/internal/sqltemplates/traces_table.sql)). A Materialized View populates a `ci_job_telemetry_traces` table from `otel_traces`, filtered by `ServiceName`.
## Architecture Document
- **MR**: <https://gitlab.com/gitlab-com/content-sites/handbook/-/merge_requests/17980>
- **Live preview**: <https://gitlab-com.gitlab.io/content-sites/handbook/mr17980/handbook/engineering/architecture/design-documents/ci_job_telemetry/>
## Parent Work Item
- <https://gitlab.com/groups/gitlab-org/quality/analytics/-/work_items/22>
## MVC Scope
1. **Runner telemetry collection**: Instrument GitLab Runner to collect timing and metadata for built-in build stages (git clone, cache, artifacts, scripts) and CI Functions
2. **OIDC/workload identity auth**: Runners on GitLab.com hosted infrastructure authenticate to the OTEL Collector using OIDC tokens or workload identity — no auth gateway needed for MVC
3. **OTEL Collector → ClickHouse**: Standard OTEL Collector with ClickHouse exporter writes OTLP spans to the `otel_traces` table (auto-created by the exporter). A [loadbalancing exporter](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/loadbalancingexporter) routes spans by `traceID` to ensure trace completeness across backends.
4. **CI telemetry Materialized View**: `ci_job_telemetry_traces` MV with denormalized CI-specific columns, populated from `otel_traces` filtered by `ServiceName`
5. **Feature negotiation**: Controlled by a **project-level feature flag** for gradual rollout. The job payload communicates only enablement (yes/no) — no sampling configuration is passed per-job. MVC targets DevExp Customer0 projects on hosted runners, then ramps up progressively.
6. **Sampling**: Configured entirely at the OTEL Collector level using the [probabilistic sampler processor](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/probabilisticsamplerprocessor). Start with a conservative rate (~10%) to validate the pipeline end-to-end, then ramp up toward 100% for the Customer0 scope. Adjustable without any Rails or Runner changes.
7. **Internal dashboards**: Enable the DevExp team (Customer 0) to build Grafana dashboards using [ClickHouse trace visualization](https://clickhouse.com/docs/observability/grafana#traces)
## Sub-Epics by Team
### MVC (Phase 1)
| Team | Epic / Work Item | Description |
|------|------------------|-------------|
| ~"group::runner core" | &20633 | Runner basic instrumentation (~1w: feature negotiation + first `job_execution` span + built-in stage spans), then CI Functions spans (~2w) |
| ~"group::ci platform" | &20945 | Feature negotiation, trace context init, MV, Rails query layer |
| ~"group::Observability" | ~~OTEL Collector + ClickHouse instance ([MR !102](https://gitlab.com/gitlab-com/gl-infra/observability/clickhouse-cloud/-/merge_requests/102))~~ | ✅ ClickHouse instance enabled (MR merged 2026-02-12). OTEL Collector deployment in progress. |
#### CI Platform Issues (~"group::ci platform" — &20945)
| Issue | Scope |
|-------|-------|
| gitlab-org/gitlab#590588+ | MVC |
| gitlab-org/gitlab#590587+ | MVC |
| gitlab-org/gitlab#590586+ | MVC |
#### Runner Core Issues (~"group::runner core" — &20633)
| Issue | Milestone | Description |
|-------|-----------|-------------|
| gitlab-org/gitlab-runner#39231+ | Basic instrumentation | Feature negotiation, OTLP export client (via LabKit), first `job_execution` span |
| gitlab-org/gitlab-runner#39230+ | Basic instrumentation | Built-in build stage span instrumentation |
| gitlab-org/gitlab-runner#39271+ | CI Functions spans | CI Functions span instrumentation |
### Post-MVC
| Team | Issue / Work Item | Phase | Description |
|------|-------------------|-------|-------------|
| ~"group::runner core" / TBD | Token-based auth gateway | Phase 2 ([Beyond GitLab.com workstream](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#workstream-beyond-gitlabcom-hosted-runners)) | Auth gateway for self-managed runners that can't use OIDC |
| ~"group::ci platform" | gitlab-org/gitlab#589219+ | Phase 2 ([Beyond GitLab.com workstream](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#workstream-beyond-gitlabcom-hosted-runners)) | `POST /api/v4/internal/ci/telemetry/auth` — validates runner/job tokens |
| ~"group::ci platform" | gitlab-org/gitlab#590589+ | Phase 3 ([Data consumable by users workstream](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#workstream-data-consumable-by-users)) | Rails GraphQL query layer for CI telemetry traces |
| ~"group::Observability" | Self-managed/Dedicated deployment | Phase 2 ([Beyond GitLab.com workstream](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#workstream-beyond-gitlabcom-hosted-runners)) | OTEL Collector + ClickHouse via [Tenant Observability Stack](https://gitlab-com.gitlab.io/gl-infra/terraform-modules/observability/tenant-observability-stack/) (`k8s-monitoring-stack` Helm chart) |
## Key Design Decisions
- **Service-agnostic backend**: `otel_traces` table is created by the standard OTEL Collector ClickHouse exporter. Any GitLab component can emit traces; CI telemetry is the first consumer via a Materialized View.
- **OIDC/workload identity auth (MVC)**: GitLab.com hosted runners authenticate directly using OIDC tokens. No auth gateway or proxy needed for MVC.
- **Token-based auth gateway (post-MVC)**: Self-managed runners that can't use OIDC will authenticate via a gateway that validates runner/job tokens against a Rails endpoint.
- **Project-level feature flag**: Rollout controlled by a project-level feature flag — enables gradual enablement starting with Customer0 projects.
- **OTEL Collector-side sampling**: Sampling rate is configured entirely at the OTEL Collector using the probabilistic sampler processor — not communicated per-job. Start with ~10% and ramp up. Adjustable without Rails or Runner changes.
- **Single endpoint with trace-aware routing**: All components (Rails, Workhorse, Runner) push to a single well-known OTEL Collector endpoint. A [loadbalancing exporter](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/loadbalancingexporter) routes spans by `traceID`, ensuring all spans for a trace converge on the same backend.
- **Grafana visualization**: MVC uses ClickHouse as a Grafana datasource for [trace visualization](https://clickhouse.com/docs/observability/grafana#traces). The Observability team is setting up the same approach for their own services (see [mimir tracing](https://gitlab.com/gitlab-com/gl-infra/observability/team/-/work_items/4416)).
- **Standard OTEL pipeline**: No custom pipeline components — standard OTEL Collector with the ClickHouse exporter.
- **OTLP format**: Industry-standard OpenTelemetry Protocol with standard fields.
- **ClickStack compatibility**: The `otel_traces` schema is [functionally compatible with ClickStack](https://clickhouse.com/docs/use-cases/observability/clickstack/ingesting-data/schemas#traces), making future migration to the full ClickStack platform straightforward.
- **Tenant Observability Stack for SM**: Self-managed/Dedicated deployment (Phase 2, Beyond GitLab.com workstream) will leverage the Observability team's [Tenant Observability Stack](https://gitlab-com.gitlab.io/gl-infra/terraform-modules/observability/tenant-observability-stack/) rather than shipping observability as part of GitLab itself. This avoids licensing concerns (for example, Grafana).
- **Observability team collaboration**: The Observability team (~"group::Observability") provides the OTEL Collector and ClickHouse instance. Architecture designed to converge with their distributed tracing infrastructure.
- **Graceful degradation**: Telemetry failures never affect job outcome.
## Staffing & Ownership
| Area | Owner | Team |
|------|-------|------|
| Feature negotiation, trace context, MV, query layer | @pedropombeiro | ~"group::ci platform" |
| Runner OTLP span emission, LabKit OTEL integration | @ash2k | ~"group::runner core" |
| OTEL Collector + ClickHouse pipeline | @nduff / @e_forbes | ~"group::Observability" |
## Rollout Strategy
Tracked in gitlab-org/gitlab#590939+ (`ci_job_telemetry` feature flag rollout).
1. **Customer0 scope**: Enable for DevExp Customer0 projects on GitLab.com hosted runners. Validate end-to-end pipeline (Runner → OTEL Collector → ClickHouse → MV).
2. **Expand to broader GitLab.com projects**: Gradually increase FF percentage, monitoring storage growth and Collector resource usage.
3. **GA / default-on**: Remove FF once pipeline is proven at scale.
Sampling follows a [phased approach](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/ci_job_telemetry/#sampling-strategy) — head sampling (MVC) → collector-based probabilistic → tail sampling for failed/slow jobs.
## Operational Ownership & On-call
| Component | Operational owner | Notes |
|-----------|-------------------|-------|
| Feature flag, Rails endpoints, MV schema | ~"group::ci platform" | Standard CI Platform on-call rotation |
| OTEL Collector, ClickHouse instance | ~"group::Observability" | Managed by Observability team's tenant stack |
| Runner OTEL SDK integration | ~"group::runner core" | Runner team owns runner-side instrumentation |
The OTEL Collector and ClickHouse instance are part of the Observability team's managed infrastructure — CI Platform does not take on-call responsibility for the telemetry pipeline itself, only for the Rails-side feature negotiation and the ClickHouse Materialized View schema.
## Monitoring & Success Metrics
### Monitoring
- **OTEL Collector health**: Collector metrics (spans received/exported, queue depth, error rates) — owned by Observability team
- **ClickHouse MV ingestion**: Row counts and lag in `ci_job_telemetry_traces` vs `otel_traces`
- **Feature flag adoption**: Number of projects/jobs with `ci_job_telemetry` enabled
### Success metrics (MVC)
| Metric | Target |
|--------|--------|
| End-to-end trace visibility | Traces from Customer0 CI jobs visible in Grafana within 5 minutes of job completion |
| Ingestion reliability | <0.1% span drop rate at the OTEL Collector |
| Query performance | p95 query latency <5s for single-job trace lookup |
| Zero user-facing impact | No measurable increase in `/api/v4/jobs/request` latency or Runner job pickup time |
## Open Questions
### ClickHouse instance access from Rails
For the query layer (gitlab-org/gitlab#590589) and the future Pipeline Optimization Agent, Rails needs to read trace data. Traces are ingested into the [Observability team's ClickHouse instance](https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles/-/merge_requests/102), which is separate from the main production ClickHouse instance that Rails queries via `ClickHouse::Client`.
Options under discussion:
1. **Rails connects to the Observability CH instance** — single source of truth, but new infra dependency
2. **Replicate data to the main CH instance** — fits existing Rails patterns, but adds duplication/lag
3. **API/service layer** — clean separation, but extra hop and maintenance
Pending [discussion with Observability team](https://gitlab.com/gitlab-com/content-sites/handbook/-/merge_requests/17980#note_3112470636).
## Timeline
- **ClickHouse instance**: ✅ Observability team's instance ([MR !102](https://gitlab.com/gitlab-com/gl-infra/observability/clickhouse-cloud/-/merge_requests/102) — merged 2026-02-12)
- **CI Platform (MVC)**: ~2-2.5 weeks — issues under &20945 (https://gitlab.com/gitlab-org/gitlab/-/work_items/590588 + https://gitlab.com/gitlab-org/gitlab/-/work_items/590587 must land before Runner basic instrumentation)
- **Runner Core (basic instrumentation)**: ~1 week (per @ash2k) — issues under &20633
epic