Skip to content

Add CI/CD job metrics to the CI/CD Analytics View for projects

Overview

GitLab's CI/CD Analytics now combines CI/CD pipeline and CI/CD job performance trends, enabling developers to identify inefficient or problematic CI/CD jobs quickly. By including these capabilities directly in the GitLab UI, we have now provided developers the tools they need in context to pinpoint and fix CI/CD performance problems that, if left unaddressed, can significantly impact development teams' velocity and overall productivity. For platform administrators, adding the CI/CD jobs data to this view also reduces the need to rely on external or custom-built CI/CD observability solutions when operating GitLab at an enterprise scale.

Feature summary

Display of CI/CD job metrics for each job in the pipeline in the CI/CD analytics page. The default time scale for the metrics = last 30 days.

The metrics to display for each job are:

  • Job name
  • Stage
  • P50 duration
  • P95 duration
  • Failure rate

GitLab Tier

  • Premium
  • Ultimate

Problem

Problem validation summary

Problem validation summary:

Admins/platform engineers want to see job and pipeline metrics as well as runner metrics in the same view: When thinking about how to optimize pipelines, runners is a major aspect, but so is the way that the pipeline is configured and even the way the repo is set up. Users need a single place where they can make effective decisions on how to optimize CI/CD, which specifically means including more extensive metrics within the pipeline and job space.

Example customer problem to solve:

Current state

  • Rather long Pipeline execution time (3.5 to 4 hours) for the whole pipeline
    • This is being run as a nightly build
    • Already includes optimizations on matrix builds
    • For feature branches: “short” pipeline with about 1h runtime, but doesn’t test everything

Challenges:

  • They need to run their pipelines on machines with real-time kernels with custom patches (on-premise hardware stack).
  • Want to optimize the pipeline execution and want some data to be more efficient, including:
    • Which part of the pipeline takes how much time?
    • How often do they fail, and why? (they have flaky pipeline jobs)
    • How long does certain infrastructure-related bit take?

Workaround solution

  • Built observability-like features themselves
  • Would like to see more/better data/visibility inside GitLab
  • They found an issue on a specific runner machine using that self-built tool

Screenshot_2025-04-09_at_11.19.02_AM

Proposal

  • Add a new panel to the dashboard with a table of job metrics with the following column:
    • job name
    • stage it belongs to
    • mean duration
    • p95 duration
    • failure rate
  • Pagination should be shown in the panel after 10 items
  • Each column should be sortable
  • By default, the table should be sorted by mean duration
  • The user should be able to use the search bar in the panel to search for a job name

🎨 Design in design management

🖌️ Figma file

Out of scope and needs to be explored in follow-up issues

The following came from the feedback phase of this issue, and requires validation before implementing 👇

  • Job duration trends (for example, the duration has been trending up in the past week)
  • Specific metrics on test jobs
  • Filtering the job panel by other data like stage or job tag(s)
  • Viewing metrics related to a single job in the table (see exploration here)
  • Viewing a breakdown of failures that make up the failure rate of a pipeline or job
  • Actions to take based on the data (for example, an action to update the ci.yaml file for a job's configuration to improve duration)
  • Visually indicating anomalies in the metrics (for example, a job's failure rate is 7% when it is usually 1%)
  • Visually indicating metrics that fall above a certain threshold set (anything above 5% failure or 1 min duration) as well as allowing users to add custom thresholds to be notified for
  • Predicting metrics based on historical data
  • Using AI to automatically optimize job or pipeline performance to result in improved speed, status, or cost

Implementation plan

We'll need to build a materialized view that aggregates the required job statistics by job name:

-- Create the MV with the new name
CREATE MATERIALIZED VIEW gitlab_clickhouse_development.ci_job_performance_daily_mv
ENGINE = AggregatingMergeTree()
PARTITION BY toYYYYMM(date)
ORDER BY (project_id, source, ref, name, stage_id, date)
AS
SELECT
    toDate(b.finished_at) AS date,
    b.project_id,
    b.stage_id,
    b.name,
    p.source,
    p.ref,
    quantilesState(0.5, 0.95)(b.duration) AS duration_quantiles,
    countState() AS total_builds,
    countStateIf(b.status = 'failed') AS failed_builds
FROM gitlab_clickhouse_development.ci_finished_builds b
INNER JOIN gitlab_clickhouse_development.ci_finished_pipelines p 
    ON b.pipeline_id = p.id
WHERE b.finished_at > 0  -- Ensure we have valid finished times
GROUP BY date, b.project_id, b.stage_id, b.name, p.source, p.ref;

We can then backfill the last 180 days with:

INSERT INTO gitlab_clickhouse_development.ci_job_performance_daily_mv
SELECT
    toDate(b.finished_at) AS date,
    b.project_id,
    b.stage_id,
    b.name,
    p.source,
    p.ref,
    quantilesState(0.5, 0.95)(b.duration) AS duration_quantiles,
    countState() AS total_builds,
    countStateIf(b.status = 'failed') AS failed_builds
FROM gitlab_clickhouse_development.ci_finished_builds b
INNER JOIN gitlab_clickhouse_development.ci_finished_pipelines p 
    ON b.pipeline_id = p.id
WHERE b.finished_at > 0
    AND b.finished_at >= today() - INTERVAL 180 DAY
GROUP BY date, b.project_id, b.stage_id, b.name, p.source, p.ref;

The query to populate the table would look like the following:

-- Last 30 days, filtered by source and ref
SELECT 
    project_id,
    stage_id,
    name,
    countMerge(total_builds) AS total_builds,
    countMerge(failed_builds) AS failed_builds,
    quantilesMerge(0.5, 0.95)(duration_quantiles) AS duration_percentiles,
    duration_percentiles[1] AS p50_duration,
    duration_percentiles[2] AS p95_duration,
    if(total_builds > 0, failed_builds / total_builds, 0) AS failure_rate
FROM gitlab_clickhouse_development.ci_job_performance_daily_mv
WHERE date >= today() - INTERVAL 30 DAY
    AND project_id = ?
    AND source = ?  -- Filter by source
    AND ref = ?     -- Filter by ref
GROUP BY project_id, stage_id, name
ORDER BY p50_duration DESC;

This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.

This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.

Edited by 🤖 GitLab Bot 🤖