Skip to content

Add service calculating ci job queuing time percentiles using ClickHouse

Vladimir Shushlin requested to merge vshushlin/ci-queueing-delay-service into master

What does this MR do and why?

Extracted from Draft: Resolve "[Runner Fleet dashboard] API fo... (!128752 - closed) this MR adds:

  1. draft database schema for ci_builds
  2. a service that selects the data from CH

It will be used in GraphQL API for:

Database queries and performance

Click to expand

NOTE: to get a decent performance locally, you need to add PARTITION BY toStartOfDay(started_at_bucket) to the materialized view:

diff --git a/db/click_house/main/20230719101806_create_ci_finished_builds_aggregated_queueing_delay_percentiles.sql b/db/click_house/main/20230719101806_create_ci_finished_builds_aggregated_queueing_delay_percentiles.sql
index 0fb59783192f..0a2aad8e5243 100644
--- a/db/click_house/main/20230719101806_create_ci_finished_builds_aggregated_queueing_delay_percentiles.sql
+++ b/db/click_house/main/20230719101806_create_ci_finished_builds_aggregated_queueing_delay_percentiles.sql
@@ -9,3 +9,6 @@ CREATE TABLE ci_finished_builds_aggregated_queueing_delay_percentiles
 )
 ENGINE = AggregatingMergeTree()
 ORDER BY (status, runner_type, started_at_bucket)
+PARTITION BY toStartOfDay(started_at_bucket)

Otherwise, we'll only have a single partition, and ClickHouse will only use a single thread. See more discussion on the slack thread

vladimirs-macbook-pro.local :) SELECT count(*) FROM ci_finished_builds;

SELECT count(*)
FROM ci_finished_builds

Query id: 3098f35e-9497-46e6-8a27-290e196facc5

┌───count()─┐
 300000000 
└───────────┘

1 row in set. Elapsed: 0.001 sec.

vladimirs-macbook-pro.local :) EXPLAIN PIPELINE SELECT  started_at_bucket as time,
        INTERVAL quantileMerge(0.99)(start_delay_quantile) SECOND as p99
,
INTERVAL quantileMerge(0.95)(start_delay_quantile) SECOND as p95
,
INTERVAL quantileMerge(0.90)(start_delay_quantile) SECOND as p90
,
INTERVAL quantileMerge(0.50)(start_delay_quantile) SECOND as p50
,
INTERVAL quantileMerge(0.25)(start_delay_quantile) SECOND as p25

FROM    ci_finished_builds_aggregated_queueing_delay_percentiles
WHERE   status IN ['success', 'failure'] AND
    started_at_bucket >= '2023-09-06 09:19:24' AND
    started_at_bucket <= '2023-09-06 12:19:24'

GROUP BY started_at_bucket
ORDER BY started_at_bucket;

EXPLAIN PIPELINE
SELECT
    started_at_bucket AS time,
    toIntervalSecond(quantileMerge(0.99)(start_delay_quantile)) AS p99,
    toIntervalSecond(quantileMerge(0.95)(start_delay_quantile)) AS p95,
    toIntervalSecond(quantileMerge(0.9)(start_delay_quantile)) AS p90,
    toIntervalSecond(quantileMerge(0.5)(start_delay_quantile)) AS p50,
    toIntervalSecond(quantileMerge(0.25)(start_delay_quantile)) AS p25
FROM ci_finished_builds_aggregated_queueing_delay_percentiles
WHERE (status IN ['success', 'failure']) AND (started_at_bucket >= '2023-09-06 09:19:24') AND (started_at_bucket <= '2023-09-06 12:19:24')
GROUP BY started_at_bucket
ORDER BY started_at_bucket ASC

Query id: 6b666bc0-21d3-43ec-af30-0cd6c25a60c5

┌─explain───────────────────────────────────────┐
 (Expression)                                  
 ExpressionTransform                           
   (Sorting)                                   
   MergingSortedTransform 10  1               
     MergeSortingTransform × 10                
       LimitsCheckingTransform × 10            
         PartialSortingTransform × 10          
           (Expression)                        
           ExpressionTransform × 10            
             (Aggregating)                     
             Resize 5  10                     
               AggregatingTransform × 5        
                 StrictResize 5  5            
                   (Expression)                
                   ExpressionTransform × 5     
                     (ReadFromMergeTree)       
                     MergeTreeThread × 5 0  1 
└───────────────────────────────────────────────┘

17 rows in set. Elapsed: 0.004 sec.

vladimirs-macbook-pro.local :) SELECT  started_at_bucket as time,
        INTERVAL quantileMerge(0.99)(start_delay_quantile) SECOND as p99
,
INTERVAL quantileMerge(0.95)(start_delay_quantile) SECOND as p95
,
INTERVAL quantileMerge(0.90)(start_delay_quantile) SECOND as p90
,
INTERVAL quantileMerge(0.50)(start_delay_quantile) SECOND as p50
,
INTERVAL quantileMerge(0.25)(start_delay_quantile) SECOND as p25

FROM    ci_finished_builds_aggregated_queueing_delay_percentiles
WHERE   status IN ['success', 'failure'] AND
    started_at_bucket >= '2023-09-06 09:19:24' AND
    started_at_bucket <= '2023-09-06 12:19:24'

GROUP BY started_at_bucket
ORDER BY started_at_bucket;

SELECT
    started_at_bucket AS time,
    toIntervalSecond(quantileMerge(0.99)(start_delay_quantile)) AS p99,
    toIntervalSecond(quantileMerge(0.95)(start_delay_quantile)) AS p95,
    toIntervalSecond(quantileMerge(0.9)(start_delay_quantile)) AS p90,
    toIntervalSecond(quantileMerge(0.5)(start_delay_quantile)) AS p50,
    toIntervalSecond(quantileMerge(0.25)(start_delay_quantile)) AS p25
FROM ci_finished_builds_aggregated_queueing_delay_percentiles
WHERE (status IN ['success', 'failure']) AND (started_at_bucket >= '2023-09-06 09:19:24') AND (started_at_bucket <= '2023-09-06 12:19:24')
GROUP BY started_at_bucket
ORDER BY started_at_bucket ASC

Query id: b81c4fd0-6de5-4abf-af7b-b3430d492137

┌───────────────────────time─┬─p99─┬─p95─┬─p90─┬─p50─┬─p25─┐
 2023-09-06 09:20:00.000000   47   30   23    6    2 
 2023-09-06 09:25:00.000000   44   29   22    6    2 
 2023-09-06 09:30:00.000000   45   30   22    7    2 
 2023-09-06 09:35:00.000000   44   30   23    6    2 
 2023-09-06 09:40:00.000000   47   30   23    6    2 
 2023-09-06 09:45:00.000000   47   29   22    6    2 
 2023-09-06 09:50:00.000000   47   30   23    6    2 
 2023-09-06 09:55:00.000000   46   30   23    6    2 
 2023-09-06 10:00:00.000000   47   30   23    6    2 
 2023-09-06 10:05:00.000000   44   29   22    7    2 
 2023-09-06 10:10:00.000000   44   28   22    6    2 
 2023-09-06 10:15:00.000000   43   29   23    6    2 
 2023-09-06 10:20:00.000000   41   29   22    6    2 
 2023-09-06 10:25:00.000000   40   28   22    6    2 
 2023-09-06 10:30:00.000000   36   26   21    6    2 
 2023-09-06 10:35:00.000000   34   26   21    6    2 
 2023-09-06 10:40:00.000000   30   24   20    6    2 
 2023-09-06 10:45:00.000000   26   22   18    6    2 
 2023-09-06 10:50:00.000000   22   19   16    5    2 
 2023-09-06 10:55:00.000000   18   15   13    5    2 
 2023-09-06 11:00:00.000000   14   12   10    4    2 
 2023-09-06 11:05:00.000000    9    8    7    3    1 
 2023-09-06 11:10:00.000000    5    4    3    1    0 
 2023-09-06 11:15:00.000000    0    0    0    0    0 
└────────────────────────────┴─────┴─────┴─────┴─────┴─────┘

24 rows in set. Elapsed: 0.073 sec. Processed 6.11 thousand rows, 837.48 KB (84.23 thousand rows/s., 11.54 MB/s.)

How to set up and validate locally

  1. Follow steps from https://docs.gitlab.com/ee/development/database/clickhouse/clickhouse_within_gitlab.html up until ClickHouseTestRunner.new.ensure_schema
  2. Connect to clickhouse and generate the data (replace 300000000 with number of builds you want to generate):
    INSERT INTO ci_finished_builds
        (id, project_id, pipeline_id, status, finished_at, created_at, started_at, queued_at,
         runner_id, runner_manager_system_xid, runner_run_untagged, runner_type,
         runner_manager_version, runner_manager_revision, runner_manager_platform, runner_manager_architecture)
    SELECT
        number AS id,
        toUInt64(1 * pow(1 - (rand() / 4294967295.0), -1/1.0)) AS project_id,
        toUInt64(1 * pow(1 - (rand() / 4294967295.0), -1/1.0)) AS pipeline_id,
        arrayElement(['success', 'success', 'success', 'success', 'success', 'failed', 'failed', 'cancelled'], 1 + (rand() % 8)) AS status,
        (now() - toIntervalSecond(rand() % (86400 * 30))) AS finished_at,
        (finished_at - toIntervalSecond(-600 * ln(1 - rand() / 4294967295.0))) AS started_at,
        (started_at - toIntervalSecond(-10 * ln(1 - rand() / 4294967295.0))) AS queued_at,
        (queued_at - toIntervalSecond(-10 * ln(1 - rand() / 4294967295.0))) AS created_at,
        rand() % 10000 AS runner_id,
        toString(rand() % 1000) AS runner_manager_system_xid,
        rand() % 2 AS runner_run_untagged,
        rand() % 3 + 1 AS runner_type,
        '' AS runner_manager_version,
        '' AS runner_manager_revision,
        '' AS runner_manager_platform,
        '' AS runner_manager_architecture
    FROM numbers(300000000);
  3. execute the following in console:
    Feature.enable(:clickhouse_ci_analytics)
    Ci::CollectQueueingDelayHistoryService.new(current_user: User.first, percentiles: [99, 95, 90, 50, 25], runner_type: 1, from_time: 3.hours.ago, to_time: Time.now).execute

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Vladimir Shushlin

Merge request reports