An error occurred while fetching the assigned milestone of the selected merge_request.

POC for ci analytics on ClickHouse

Review changes
Download
Patches
Plain diff

Closed POC for ci analytics on ClickHouse

Overview 36
Commits 2
Pipelines 12
Changes 5

Closed Vladimir Shushlin requested to merge vshushlin/runner-ch-poc into master 1 year ago

Overview 36
Commits 2
Pipelines 12
Changes 1

What does this MR do and why?

Here's the explaining video: https://www.loom.com/share/3cf98980e8334f14a4740836989f4ff6?sid=d80c1c6d-c636-4236-aad9-b1d75785b910

This MR is Proof of concept of using ClickHouse for running analytics queries on GitLab CI database.

It is focused on implementing Wait time to pick up a job from the runner dashboard: #390921[new.png]

Such queries are slow in Postgres even with indexes and specialized table that has only small portion of builds: !110137 (comment 1298955647)

This MR adds:

a few tables to ClickHouse schema
a service to sync finished builds from postgres to clickhouse
a few queries with comparable performance(see below)

Data ingestion strategy

The rough idea is:

to only sync finished builds (Click House isn't optimized for updating the data)
run a special background worker every X amount of time: it would collect new records and push them in batches

Schema

There are 3 tables:

ci_finished_builds - basically raw data, we can calculate metrics on it directly, but it may be not optimal
ci_finished_builds_max_finished_at - for storing the maximum value of tuple (finished_at, id) for easy access during sync
ci_finished_builds_by_runner_type - an aggregate materialized view storing pre-calculated percentiles

Queries:

-- direct aggregate
SELECT   toStartOfInterval(started_at, interval 5 minute) AS started_at_bucket,
         count(*),
         quantile(0.9)(age('second', queued_at, started_at)),
         quantile(0.5)(age('second', queued_at, started_at)),
         quantile(0.25)(age('second', queued_at, started_at))
FROM     ci_finished_builds
WHERE    status = 'success'
AND      runner_type = 0
GROUP BY started_at_bucket
ORDER BY started_at_bucket DESC limit 1000;

-- 1000 rows in set. Elapsed: 0.393 sec. Processed 16.54 million rows, 297.79 MB (42.05 million rows/s., 756.98 MB/s.)

-- aggreated on materialized view
SELECT   started_at_bucket,
         countMerge(count_builds),
         quantileMerge(0.9)(wait_time_90_percentile),
         quantileMerge(0.50)(wait_time_50_percentile),
         quantileMerge(0.25)(wait_time_25_percentile)
FROM     ci_finished_builds_by_runner_type
WHERE    status = 'success'
AND      runner_type = 0
GROUP BY started_at_bucket
ORDER BY started_at_bucket DESC limit 1000;

-- 1000 rows in set. Elapsed: 0.784 sec. Processed 7.79 thousand rows, 3.16 MB (9.95 thousand rows/s., 4.04 MB/s.)

Both queries generate similar data, example:

┌───started_at_bucket─┬─countMerge(count_builds)─┬─quantileMerge(0.9)(wait_time_90_percentile)─┬─quantileMerge(0.5)(wait_time_50_percentile)─┬─quantileMerge(0.25)(wait_time_25_percentile)─┐
│ 2023-06-18 12:50:00 │                      129 │                                     31867.2 │                                       21575 │                                        16678 │
│ 2023-06-18 12:45:00 │                      124 │                                     33349.8 │                                       22863 │                                        16601 │
│ 2023-06-18 12:40:00 │                      136 │                                     29959.5 │                                       22481 │                                        15853 │
│ 2023-06-18 12:35:00 │                      129 │                                     33766.8 │                                       20995 │                                        15818 │
│ 2023-06-18 12:30:00 │                      107 │                                     32567.4 │                                       20760 │                                        15981 │
│ 2023-06-18 12:25:00 │                      132 │                          32304.500000000004 │                                       21357 │                                      15688.5 │
│ 2023-06-18 12:20:00 │                      122 │                          32344.000000000004 │                                     20804.5 │                                     15828.25 │
│ 2023-06-18 12:15:00 │                      128 │                                     33010.6 │                                       21306 │                                        15656 │
│ 2023-06-18 12:10:00 │                      133 │                                     31407.2 │                                       21868 │                                        15779 │

As you see, the second query reads significantly smaller amount of data, but it takes almost exactly the same amount of time.

I have a few hypothesis why the time doesn't improve much:

every bucket only has about 100 rows, and maybe it's easier for ClickHouse to just calculate quantiles directly, than using complex quantileState structure.
1M is nowhere near enough to test performance and the time I'm getting it just basically just overhead on parsing query etc...

Screenshots or screen recordings

How to set up and validate locally

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Edited 1 year ago by Vladimir Shushlin

Merge request reports

0 Assignees

None

Select assignees

0 Reviewers

None

Request review from

Labels

None

Select labels

Manage project labels

Milestone

None

Time tracking

No estimate or time spent

0 Participants

Compare

and

2 commits, 1 year ago

version 11

f29a9a81

2 commits, 1 year ago

version 10

f535f4f3

2 commits, 1 year ago

version 9

f21e9fa3

2 commits, 1 year ago

version 8

877f3401

2 commits, 1 year ago

version 7

4cdb35ef

2 commits, 1 year ago

version 6

6f538a0c

2 commits, 1 year ago

8 commits, 1 year ago

version 3

fc4ebdbe

8 commits, 1 year ago

version 2

fc4ebdbe

8 commits, 1 year ago

version 1

921d77a0

8 commits, 1 year ago

Show latest version

1 file

+ 1

− 1

Compare changes

Side-by-side
Inline

Show whitespace changes

Show one file at a time

db/click_house/main/20230719101806_create_ci_builds_by_runner_type.sql

+ 1

− 1

CREATE MATERIALIZED VIEW ci_builds_by_runner_type
CREATE MATERIALIZED VIEW ci_finished_builds_by_runner_type
ENGINE = AggregatingMergeTree()
ORDER BY (status, runner_type, started_at_bucket)
POPULATE AS