POC for ci analytics on ClickHouse
What does this MR do and why?
Here's the explaining video: https://www.loom.com/share/3cf98980e8334f14a4740836989f4ff6?sid=d80c1c6d-c636-4236-aad9-b1d75785b910
This MR is Proof of concept of using ClickHouse for running analytics queries on GitLab CI database.
It is focused on implementing Wait time to pick up a job
from the runner dashboard: #390921[new.png]
Such queries are slow in Postgres even with indexes and specialized table that has only small portion of builds: !110137 (comment 1298955647)
This MR adds:
- a few tables to ClickHouse schema
- a service to sync finished builds from postgres to clickhouse
- a few queries with comparable performance(see below)
Data ingestion strategy
The rough idea is:
- to only sync
finished
builds (Click House isn't optimized for updating the data) - run a special background worker every X amount of time: it would collect new records and push them in batches
Schema
There are 3 tables:
-
ci_finished_builds
- basically raw data, we can calculate metrics on it directly, but it may be not optimal -
ci_finished_builds_max_finished_at
- for storing the maximum value of tuple(finished_at, id)
for easy access during sync -
ci_finished_builds_by_runner_type
- an aggregate materialized view storing pre-calculated percentiles
Queries:
-- direct aggregate
SELECT toStartOfInterval(started_at, interval 5 minute) AS started_at_bucket,
count(*),
quantile(0.9)(age('second', queued_at, started_at)),
quantile(0.5)(age('second', queued_at, started_at)),
quantile(0.25)(age('second', queued_at, started_at))
FROM ci_finished_builds
WHERE status = 'success'
AND runner_type = 0
GROUP BY started_at_bucket
ORDER BY started_at_bucket DESC limit 1000;
-- 1000 rows in set. Elapsed: 0.393 sec. Processed 16.54 million rows, 297.79 MB (42.05 million rows/s., 756.98 MB/s.)
-- aggreated on materialized view
SELECT started_at_bucket,
countMerge(count_builds),
quantileMerge(0.9)(wait_time_90_percentile),
quantileMerge(0.50)(wait_time_50_percentile),
quantileMerge(0.25)(wait_time_25_percentile)
FROM ci_finished_builds_by_runner_type
WHERE status = 'success'
AND runner_type = 0
GROUP BY started_at_bucket
ORDER BY started_at_bucket DESC limit 1000;
-- 1000 rows in set. Elapsed: 0.784 sec. Processed 7.79 thousand rows, 3.16 MB (9.95 thousand rows/s., 4.04 MB/s.)
Both queries generate similar data, example:
┌───started_at_bucket─┬─countMerge(count_builds)─┬─quantileMerge(0.9)(wait_time_90_percentile)─┬─quantileMerge(0.5)(wait_time_50_percentile)─┬─quantileMerge(0.25)(wait_time_25_percentile)─┐
│ 2023-06-18 12:50:00 │ 129 │ 31867.2 │ 21575 │ 16678 │
│ 2023-06-18 12:45:00 │ 124 │ 33349.8 │ 22863 │ 16601 │
│ 2023-06-18 12:40:00 │ 136 │ 29959.5 │ 22481 │ 15853 │
│ 2023-06-18 12:35:00 │ 129 │ 33766.8 │ 20995 │ 15818 │
│ 2023-06-18 12:30:00 │ 107 │ 32567.4 │ 20760 │ 15981 │
│ 2023-06-18 12:25:00 │ 132 │ 32304.500000000004 │ 21357 │ 15688.5 │
│ 2023-06-18 12:20:00 │ 122 │ 32344.000000000004 │ 20804.5 │ 15828.25 │
│ 2023-06-18 12:15:00 │ 128 │ 33010.6 │ 21306 │ 15656 │
│ 2023-06-18 12:10:00 │ 133 │ 31407.2 │ 21868 │ 15779 │
As you see, the second query reads significantly smaller amount of data, but it takes almost exactly the same amount of time.
I have a few hypothesis why the time doesn't improve much:
- every
bucket
only has about 100 rows, and maybe it's easier for ClickHouse to just calculate quantiles directly, than using complexquantileState
structure. - 1M is nowhere near enough to test performance and the time I'm getting it just basically just overhead on parsing query etc...
Screenshots or screen recordings
How to set up and validate locally
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.