Add projections to `ci_finished_builds`, `ci_finished_pipelines` ch table (!202891) · Merge requests · GitLab.org / GitLab

What does this MR do and why?

Add projection build_stats_by_project_pipeline_name_stage to ci_finished_builds ch table
Add projection by_path_source_ref_finished_at to ci_finished_pipelines ch table

Why this change?

Draft: GraphQL API for Project job Analytics (!201267 - closed) - Query in this MR takes a minimum of 3-5 seconds in production
This is because the table has no necessary indexes, projections or materialized views.
These projections are balanced to support all the queries we support and are going to support in the above MR.

References

Add jobAnalytics GraphQL API to fetch data on o... (#541703 - closed)

How to set up and validate locally

Run this to execute the migration

bundle exec rake gitlab:clickhouse:migrate

Clickhouse add this projection async. So, this migration runs faster
To validate if the job is running, run this in clickhouse console

SELECT *
FROM system.mutations
WHERE database = 'gitlab_clickhouse_development'
  AND table IN ('ci_finished_builds', 'ci_finished_pipelines')
  AND NOT is_done;

Depending on the data size, it may take a while. (Took more than 1 hour with 1B records in local)
Once the above job is done in parts. Run these queries to check the performance

-- 1. WITH projection (default - should be fast)
SELECT /* with_projection */
    b.stage_id,
    b.name,
    avg(b.duration) / 1000.0                      AS mean_duration_in_seconds,
    quantile(0.95)(b.duration) / 1000.0           AS p95_duration_in_seconds,
    countIf(b.status = 'success') / count() * 100 AS rate_of_success
FROM ci_finished_builds b
WHERE project_id = 278964
GROUP BY b.stage_id, b.name
ORDER BY rate_of_success DESC
LIMIT 10
    SETTINGS optimize_use_projections = 1;
-- Explicitly enable

-- 2. WITHOUT projection (should be slow)
SELECT /* without_projection */
    b.stage_id,
    b.name,
    avg(b.duration) / 1000.0                      AS mean_duration_in_seconds,
    quantile(0.95)(b.duration) / 1000.0           AS p95_duration_in_seconds,
    countIf(b.status = 'success') / count() * 100 AS rate_of_success
FROM ci_finished_builds b
WHERE project_id = 278964
GROUP BY b.stage_id, b.name
ORDER BY rate_of_success DESC
LIMIT 10
    SETTINGS optimize_use_projections = 0;
-- Disable projections

-- 3. Check the performance metrics
SELECT query,
       event_time_microseconds,
       query_start_time_microseconds,
       query_duration_ms,
       read_rows,
       read_bytes,
       memory_usage,
       partitions,
       peak_threads_usage
FROM system.query_log
WHERE type = 'QueryFinish'
  AND (query LIKE '%with_projection%' OR query LIKE '%without_projection%')
  AND query LIKE '%278964%'
  AND event_time > now() - INTERVAL 2 MINUTE
ORDER BY event_time DESC
LIMIT 2;

Observed results with 1B records.

Metric	Without Projection	With Projection	Improvement
Query Duration	1455 ms	485 ms	3x faster
Rows Read	260,736,511 (260M)	35,322 (35K)	7,380x less
Bytes Read	11.6 GB	7.75 MB	1,500x less
Memory Usage	252 MB	945 MB	Higher

Query Plan

EXPLAIN
SELECT b.stage_id,
       b.name,
       avg(b.duration) / 1000.0                      AS mean_duration_in_seconds,
       quantile(0.95)(b.duration) / 1000.0           AS p95_duration_in_seconds,
       countIf(b.status = 'success') / count() * 100 AS rate_of_success
FROM ci_finished_builds b
WHERE project_id = 278964
GROUP BY b.stage_id, b.name
ORDER BY rate_of_success DESC
LIMIT 10

-- Result

    Expression
((Project names + (Before ORDER BY + Projection) [lifted up part]))
    Limit (preliminary LIMIT (without OFFSET))
Sorting
    (Sorting for ORDER BY)
    Expression
    ((Before ORDER BY + Projection))
    Aggregating
          Filter
            ReadFromMergeTree (proj_by_name_stage)

Uncovered and needs to be discussed (and will be taken in next MRs):

Search by job name

Can we create an index like this?

    ALTER TABLE ci_finished_builds
    ADD INDEX name_lowercase(lower(name)) TYPE text;

    ALTER TABLE ci_finished_builds MATERIALIZE INDEX name_lowercase;

Filter by pipeline attrs
- We don't have project_id in ci_finished_pipelines (so it creates search space huge)
- But, we have project_path, need to check if that helps
- Can't use hourly, daily MVs as those are aggregations and won't have pipeline id
- Once this MR is shipped, we can choose any one of the approaches below:
  - Use join
  - Use IN
    - Sample Query:

SELECT b.stage_id,
       b.name,
       avg(b.duration) / 1000.0                      AS mean_duration_in_seconds,
       quantile(0.95)(b.duration) / 1000.0           AS p95_duration_in_seconds,
       countIf(b.status = 'success') / count() * 100 AS rate_of_success
FROM ci_finished_builds b
WHERE project_id = 278964
  AND pipeline_id in (SELECT id from ci_finished_pipelines where source = 'api' and ref = 'main')
GROUP BY b.stage_id, b.name
ORDER BY rate_of_success DESC
LIMIT 10
    SETTINGS optimize_use_projections = 1

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited Sep 03, 2025 by Narendran

Add projections to ci_finished_builds, ci_finished_pipelines ch table

What does this MR do and why?

References

How to set up and validate locally

Uncovered and needs to be discussed (and will be taken in next MRs):

MR acceptance checklist

Merge request reports

Add projections to `ci_finished_builds`, `ci_finished_pipelines` ch table