Skip to content

POC for ci analytics on ClickHouse

Vladimir Shushlin requested to merge vshushlin/runner-ch-poc into master

What does this MR do and why?

Here's the explaining video: https://www.loom.com/share/3cf98980e8334f14a4740836989f4ff6?sid=d80c1c6d-c636-4236-aad9-b1d75785b910

This MR is Proof of concept of using ClickHouse for running analytics queries on GitLab CI database.

It is focused on implementing Wait time to pick up a job from the runner dashboard: #390921[new.png]

Such queries are slow in Postgres even with indexes and specialized table that has only small portion of builds: !110137 (comment 1298955647)

This MR adds:

  1. a few tables to ClickHouse schema
  2. a service to sync finished builds from postgres to clickhouse
  3. a few queries with comparable performance(see below)

Data ingestion strategy

The rough idea is:

  1. to only sync finished builds (Click House isn't optimized for updating the data)
  2. run a special background worker every X amount of time: it would collect new records and push them in batches

Schema

There are 3 tables:

  1. ci_finished_builds - basically raw data, we can calculate metrics on it directly, but it may be not optimal
  2. ci_finished_builds_max_finished_at - for storing the maximum value of tuple (finished_at, id) for easy access during sync
  3. ci_finished_builds_by_runner_type - an aggregate materialized view storing pre-calculated percentiles

Queries:

-- direct aggregate
SELECT   toStartOfInterval(started_at, interval 5 minute) AS started_at_bucket,
         count(*),
         quantile(0.9)(age('second', queued_at, started_at)),
         quantile(0.5)(age('second', queued_at, started_at)),
         quantile(0.25)(age('second', queued_at, started_at))
FROM     ci_finished_builds
WHERE    status = 'success'
AND      runner_type = 0
GROUP BY started_at_bucket
ORDER BY started_at_bucket DESC limit 1000;

-- 1000 rows in set. Elapsed: 0.393 sec. Processed 16.54 million rows, 297.79 MB (42.05 million rows/s., 756.98 MB/s.)

-- aggreated on materialized view
SELECT   started_at_bucket,
         countMerge(count_builds),
         quantileMerge(0.9)(wait_time_90_percentile),
         quantileMerge(0.50)(wait_time_50_percentile),
         quantileMerge(0.25)(wait_time_25_percentile)
FROM     ci_finished_builds_by_runner_type
WHERE    status = 'success'
AND      runner_type = 0
GROUP BY started_at_bucket
ORDER BY started_at_bucket DESC limit 1000;

-- 1000 rows in set. Elapsed: 0.784 sec. Processed 7.79 thousand rows, 3.16 MB (9.95 thousand rows/s., 4.04 MB/s.)

Both queries generate similar data, example:

┌───started_at_bucket─┬─countMerge(count_builds)─┬─quantileMerge(0.9)(wait_time_90_percentile)─┬─quantileMerge(0.5)(wait_time_50_percentile)─┬─quantileMerge(0.25)(wait_time_25_percentile)─┐
│ 2023-06-18 12:50:00 │                      129 │                                     31867.2 │                                       21575 │                                        16678 │
│ 2023-06-18 12:45:00 │                      124 │                                     33349.8 │                                       22863 │                                        16601 │
│ 2023-06-18 12:40:00 │                      136 │                                     29959.5 │                                       22481 │                                        15853 │
│ 2023-06-18 12:35:00 │                      129 │                                     33766.8 │                                       20995 │                                        15818 │
│ 2023-06-18 12:30:00 │                      107 │                                     32567.4 │                                       20760 │                                        15981 │
│ 2023-06-18 12:25:00 │                      132 │                          32304.500000000004 │                                       21357 │                                      15688.5 │
│ 2023-06-18 12:20:00 │                      122 │                          32344.000000000004 │                                     20804.5 │                                     15828.25 │
│ 2023-06-18 12:15:00 │                      128 │                                     33010.6 │                                       21306 │                                        15656 │
│ 2023-06-18 12:10:00 │                      133 │                                     31407.2 │                                       21868 │                                        15779 │

As you see, the second query reads significantly smaller amount of data, but it takes almost exactly the same amount of time.

I have a few hypothesis why the time doesn't improve much:

  1. every bucket only has about 100 rows, and maybe it's easier for ClickHouse to just calculate quantiles directly, than using complex quantileState structure.
  2. 1M is nowhere near enough to test performance and the time I'm getting it just basically just overhead on parsing query etc...

Screenshots or screen recordings

How to set up and validate locally

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Vladimir Shushlin

Merge request reports