Add projections to ci_finished_builds, ci_finished_pipelines ch table

What does this MR do and why?

  • Add projection build_stats_by_project_pipeline_name_stage to ci_finished_builds ch table
  • Add projection by_path_source_ref_finished_at to ci_finished_pipelines ch table

Why this change?

  • Draft: GraphQL API for Project job Analytics (!201267 - closed) - Query in this MR takes a minimum of 3-5 seconds in production
  • This is because the table has no necessary indexes, projections or materialized views.
  • These projections are balanced to support all the queries we support and are going to support in the above MR.

References

Add jobAnalytics GraphQL API to fetch data on o... (#541703 - closed)

How to set up and validate locally

  • Run this to execute the migration
bundle exec rake gitlab:clickhouse:migrate
  • Clickhouse add this projection async. So, this migration runs faster
  • To validate if the job is running, run this in clickhouse console
SELECT *
FROM system.mutations
WHERE database = 'gitlab_clickhouse_development'
  AND table IN ('ci_finished_builds', 'ci_finished_pipelines')
  AND NOT is_done;
  • Depending on the data size, it may take a while. (Took more than 1 hour with 1B records in local)
  • Once the above job is done in parts. Run these queries to check the performance
-- 1. WITH projection (default - should be fast)
SELECT /* with_projection */
    b.stage_id,
    b.name,
    avg(b.duration) / 1000.0                      AS mean_duration_in_seconds,
    quantile(0.95)(b.duration) / 1000.0           AS p95_duration_in_seconds,
    countIf(b.status = 'success') / count() * 100 AS rate_of_success
FROM ci_finished_builds b
WHERE project_id = 278964
GROUP BY b.stage_id, b.name
ORDER BY rate_of_success DESC
LIMIT 10
    SETTINGS optimize_use_projections = 1;
-- Explicitly enable

-- 2. WITHOUT projection (should be slow)
SELECT /* without_projection */
    b.stage_id,
    b.name,
    avg(b.duration) / 1000.0                      AS mean_duration_in_seconds,
    quantile(0.95)(b.duration) / 1000.0           AS p95_duration_in_seconds,
    countIf(b.status = 'success') / count() * 100 AS rate_of_success
FROM ci_finished_builds b
WHERE project_id = 278964
GROUP BY b.stage_id, b.name
ORDER BY rate_of_success DESC
LIMIT 10
    SETTINGS optimize_use_projections = 0;
-- Disable projections

-- 3. Check the performance metrics
SELECT query,
       event_time_microseconds,
       query_start_time_microseconds,
       query_duration_ms,
       read_rows,
       read_bytes,
       memory_usage,
       partitions,
       peak_threads_usage
FROM system.query_log
WHERE type = 'QueryFinish'
  AND (query LIKE '%with_projection%' OR query LIKE '%without_projection%')
  AND query LIKE '%278964%'
  AND event_time > now() - INTERVAL 2 MINUTE
ORDER BY event_time DESC
LIMIT 2;
  • Observed results with 1B records.
Metric Without Projection With Projection Improvement
Query Duration 1455 ms 485 ms 3x faster
Rows Read 260,736,511 (260M) 35,322 (35K) 7,380x less
Bytes Read 11.6 GB 7.75 MB 1,500x less
Memory Usage 252 MB 945 MB Higher
  • Query Plan
EXPLAIN
SELECT b.stage_id,
       b.name,
       avg(b.duration) / 1000.0                      AS mean_duration_in_seconds,
       quantile(0.95)(b.duration) / 1000.0           AS p95_duration_in_seconds,
       countIf(b.status = 'success') / count() * 100 AS rate_of_success
FROM ci_finished_builds b
WHERE project_id = 278964
GROUP BY b.stage_id, b.name
ORDER BY rate_of_success DESC
LIMIT 10

-- Result

    Expression
((Project names + (Before ORDER BY + Projection) [lifted up part]))
    Limit (preliminary LIMIT (without OFFSET))
Sorting
    (Sorting for ORDER BY)
    Expression
    ((Before ORDER BY + Projection))
    Aggregating
          Filter
            ReadFromMergeTree (proj_by_name_stage)

Uncovered and needs to be discussed (and will be taken in next MRs):

  • Search by job name
    • Can we create an index like this?
          ALTER TABLE ci_finished_builds
          ADD INDEX name_lowercase(lower(name)) TYPE text;
      
          ALTER TABLE ci_finished_builds MATERIALIZE INDEX name_lowercase;
  • Filter by pipeline attrs
    • We don't have project_id in ci_finished_pipelines (so it creates search space huge)
    • But, we have project_path, need to check if that helps
    • Can't use hourly, daily MVs as those are aggregations and won't have pipeline id
    • Once this MR is shipped, we can choose any one of the approaches below:
      • Use join
      • Use IN
        • Sample Query:
SELECT b.stage_id,
       b.name,
       avg(b.duration) / 1000.0                      AS mean_duration_in_seconds,
       quantile(0.95)(b.duration) / 1000.0           AS p95_duration_in_seconds,
       countIf(b.status = 'success') / count() * 100 AS rate_of_success
FROM ci_finished_builds b
WHERE project_id = 278964
  AND pipeline_id in (SELECT id from ci_finished_pipelines where source = 'api' and ref = 'main')
GROUP BY b.stage_id, b.name
ORDER BY rate_of_success DESC
LIMIT 10
    SETTINGS optimize_use_projections = 1

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Narendran

Merge request reports

Loading