An error occurred while fetching the assigned milestone of the selected merge_request.
POC for ci analytics on ClickHouse
Compare changes
Here's the explaining video: https://www.loom.com/share/3cf98980e8334f14a4740836989f4ff6?sid=d80c1c6d-c636-4236-aad9-b1d75785b910
This MR is Proof of concept of using ClickHouse for running analytics queries on GitLab CI database.
It is focused on implementing Wait time to pick up a job
from the runner dashboard: #390921[new.png]
Such queries are slow in Postgres even with indexes and specialized table that has only small portion of builds: !110137 (comment 1298955647)
This MR adds:
The rough idea is:
finished
builds (Click House isn't optimized for updating the data)There are 3 tables:
ci_finished_builds
- basically raw data, we can calculate metrics on it directly, but it may be not optimalci_finished_builds_max_finished_at
- for storing the maximum value of tuple (finished_at, id)
for easy access during syncci_finished_builds_by_runner_type
- an aggregate materialized view storing pre-calculated percentiles-- direct aggregate
SELECT toStartOfInterval(started_at, interval 5 minute) AS started_at_bucket,
count(*),
quantile(0.9)(age('second', queued_at, started_at)),
quantile(0.5)(age('second', queued_at, started_at)),
quantile(0.25)(age('second', queued_at, started_at))
FROM ci_finished_builds
WHERE status = 'success'
AND runner_type = 0
GROUP BY started_at_bucket
ORDER BY started_at_bucket DESC limit 1000;
-- 1000 rows in set. Elapsed: 0.393 sec. Processed 16.54 million rows, 297.79 MB (42.05 million rows/s., 756.98 MB/s.)
-- aggreated on materialized view
SELECT started_at_bucket,
countMerge(count_builds),
quantileMerge(0.9)(wait_time_90_percentile),
quantileMerge(0.50)(wait_time_50_percentile),
quantileMerge(0.25)(wait_time_25_percentile)
FROM ci_finished_builds_by_runner_type
WHERE status = 'success'
AND runner_type = 0
GROUP BY started_at_bucket
ORDER BY started_at_bucket DESC limit 1000;
-- 1000 rows in set. Elapsed: 0.784 sec. Processed 7.79 thousand rows, 3.16 MB (9.95 thousand rows/s., 4.04 MB/s.)
Both queries generate similar data, example:
┌───started_at_bucket─┬─countMerge(count_builds)─┬─quantileMerge(0.9)(wait_time_90_percentile)─┬─quantileMerge(0.5)(wait_time_50_percentile)─┬─quantileMerge(0.25)(wait_time_25_percentile)─┐
│ 2023-06-18 12:50:00 │ 129 │ 31867.2 │ 21575 │ 16678 │
│ 2023-06-18 12:45:00 │ 124 │ 33349.8 │ 22863 │ 16601 │
│ 2023-06-18 12:40:00 │ 136 │ 29959.5 │ 22481 │ 15853 │
│ 2023-06-18 12:35:00 │ 129 │ 33766.8 │ 20995 │ 15818 │
│ 2023-06-18 12:30:00 │ 107 │ 32567.4 │ 20760 │ 15981 │
│ 2023-06-18 12:25:00 │ 132 │ 32304.500000000004 │ 21357 │ 15688.5 │
│ 2023-06-18 12:20:00 │ 122 │ 32344.000000000004 │ 20804.5 │ 15828.25 │
│ 2023-06-18 12:15:00 │ 128 │ 33010.6 │ 21306 │ 15656 │
│ 2023-06-18 12:10:00 │ 133 │ 31407.2 │ 21868 │ 15779 │
As you see, the second query reads significantly smaller amount of data, but it takes almost exactly the same amount of time.
I have a few hypothesis why the time doesn't improve much:
bucket
only has about 100 rows, and maybe it's easier for ClickHouse to just calculate quantiles directly, than using complex quantileState
structure.This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.