POC for ci analytics on ClickHouse
What does this MR do and why?
Here's the explaining video: https://www.loom.com/share/3cf98980e8334f14a4740836989f4ff6?sid=d80c1c6d-c636-4236-aad9-b1d75785b910
This MR is Proof of concept of using ClickHouse for running analytics queries on GitLab CI database.
It is focused on implementing Wait time to pick up a job
from the runner dashboard: #390921[new.png]
Such queries are slow in Postgres even with indexes and specialized table that has only small portion of builds: !110137 (comment 1298955647)
This MR adds:
- a few tables to ClickHouse schema
- a service to sync finished builds from postgres to clickhouse
- a few queries with comparable performance(see below)
Data ingestion strategy
The rough idea is:
- to only sync
finished
builds (Click House isn't optimized for updating the data) - run a special background worker every X amount of time: it would collect new records and push them in batches
Schema
There are 3 tables:
-
ci_finished_builds
- basically raw data, we can calculate metrics on it directly, but it may be not optimal -
ci_finished_builds_max_finished_at
- for storing the maximum value of tuple(finished_at, id)
for easy access during sync -
ci_finished_builds_by_runner_type
- an aggregate materialized view storing pre-calculated percentiles
Queries:
-- direct aggregate
SELECT toStartOfInterval(started_at, interval 5 minute) AS started_at_bucket,
count(*),
quantile(0.9)(age('second', queued_at, started_at)),
quantile(0.5)(age('second', queued_at, started_at)),
quantile(0.25)(age('second', queued_at, started_at))
FROM ci_finished_builds
WHERE status = 'success'
AND runner_type = 0
GROUP BY started_at_bucket
ORDER BY started_at_bucket DESC limit 1000;
-- 1000 rows in set. Elapsed: 0.393 sec. Processed 16.54 million rows, 297.79 MB (42.05 million rows/s., 756.98 MB/s.)
-- aggreated on materialized view
SELECT started_at_bucket,
countMerge(count_builds),
quantileMerge(0.9)(wait_time_90_percentile),
quantileMerge(0.50)(wait_time_50_percentile),
quantileMerge(0.25)(wait_time_25_percentile)
FROM ci_finished_builds_by_runner_type
WHERE status = 'success'
AND runner_type = 0
GROUP BY started_at_bucket
ORDER BY started_at_bucket DESC limit 1000;
-- 1000 rows in set. Elapsed: 0.784 sec. Processed 7.79 thousand rows, 3.16 MB (9.95 thousand rows/s., 4.04 MB/s.)
Both queries generate similar data, example:
┌───started_at_bucket─┬─countMerge(count_builds)─┬─quantileMerge(0.9)(wait_time_90_percentile)─┬─quantileMerge(0.5)(wait_time_50_percentile)─┬─quantileMerge(0.25)(wait_time_25_percentile)─┐
│ 2023-06-18 12:50:00 │ 129 │ 31867.2 │ 21575 │ 16678 │
│ 2023-06-18 12:45:00 │ 124 │ 33349.8 │ 22863 │ 16601 │
│ 2023-06-18 12:40:00 │ 136 │ 29959.5 │ 22481 │ 15853 │
│ 2023-06-18 12:35:00 │ 129 │ 33766.8 │ 20995 │ 15818 │
│ 2023-06-18 12:30:00 │ 107 │ 32567.4 │ 20760 │ 15981 │
│ 2023-06-18 12:25:00 │ 132 │ 32304.500000000004 │ 21357 │ 15688.5 │
│ 2023-06-18 12:20:00 │ 122 │ 32344.000000000004 │ 20804.5 │ 15828.25 │
│ 2023-06-18 12:15:00 │ 128 │ 33010.6 │ 21306 │ 15656 │
│ 2023-06-18 12:10:00 │ 133 │ 31407.2 │ 21868 │ 15779 │
As you see, the second query reads significantly smaller amount of data, but it takes almost exactly the same amount of time.
I have a few hypothesis why the time doesn't improve much:
- every
bucket
only has about 100 rows, and maybe it's easier for ClickHouse to just calculate quantiles directly, than using complexquantileState
structure. - 1M is nowhere near enough to test performance and the time I'm getting it just basically just overhead on parsing query etc...
Screenshots or screen recordings
How to set up and validate locally
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Merge request reports
Activity
assigned to @vshushlin
@vshushlin - please see the following guidance and update this merge request.1 Error Please add typebug typefeature, or typemaintenance label to this merge request.
- A deleted user
added backend database databasereview pending labels
8 Warnings 2e56b454: The commit subject must contain at least 3 words. For more information, take a look at our Commit message guidelines. 2e56b454: Commits that change 30 or more lines across at least 3 files should describe these changes in the commit body. For more information, take a look at our Commit message guidelines. 110511ad: The commit subject must contain at least 3 words. For more information, take a look at our Commit message guidelines. There were no new or modified feature flag YAML files detected in this MR. If the changes here are already controlled under an existing feature flag, please add
the feature flagexists. Otherwise, if you think the changes here don't need
to be under a feature flag, please add the label feature flagskipped, and
add a short comment about why we skipped the feature flag.For guidance on when to use a feature flag, please see the documentation.
This merge request does not refer to an existing milestone. You've made some app changes, but didn't add any tests.
That's OK as long as you're refactoring existing code,
but please consider adding any of the maintenancepipelines, maintenancerefactor, maintenanceworkflow, documentation, QA labels.Please add a merge request subtype to this merge request. Please add a merge request type to this merge request. 2 Messages CHANGELOG missing: If you want to create a changelog entry for GitLab FOSS, add the
Changelog
trailer to the commit message you want to add to the changelog.If you want to create a changelog entry for GitLab EE, also add the
EE: true
trailer to your commit message.If this merge request doesn't need a CHANGELOG entry, feel free to ignore this message.
This merge request adds or changes files that require a review from the Database team. This merge request requires a database review. To make sure these changes are reviewed, take the following steps:
- Ensure the merge request has database and databasereview pending labels. If the merge request modifies database files, Danger will do this for you.
- Prepare your MR for database review according to the docs.
- Assign and mention the database reviewer suggested by Reviewer Roulette.
The following files require a review from the Database team:
db/click_house/main/20230707151359_create_ci_finished_builds.sql
db/click_house/main/20230719101203_create_ci_finished_builds_max_finished_at.sql
db/click_house/main/20230719101806_create_ci_builds_by_runner_type.sql
Reviewer roulette
Changes that require review have been detected!
Please refer to the table below for assigning reviewers and maintainers suggested by Danger in the specified category:
Category Reviewer Maintainer backend Eugie Limpin (
@eugielimpin
) (UTC+8, 6 hours ahead of@vshushlin
)Mehmet Emin Inac (
@minac
) (UTC+2, same timezone as@vshushlin
)database Max Woolf (
@mwoolf
) (UTC+1, 1 hour behind@vshushlin
)Omar Qunsul (
@OmarQunsulGitlab
) (UTC+2, same timezone as@vshushlin
)~"Verify" Reviewer review is optional for ~"Verify" Drew Cimino (
@drew
) (UTC+0, 2 hours behind@vshushlin
)To spread load more evenly across eligible reviewers, Danger has picked a candidate for each review slot, based on their timezone. Feel free to override these selections if you think someone else would be better-suited or use the GitLab Review Workload Dashboard to find other available reviewers.
To read more on how to use the reviewer roulette, please take a look at the Engineering workflow and code review guidelines. Please consider assigning a reviewer or maintainer who is a domain expert in the area of the merge request.
Once you've decided who will review this merge request, assign them as a reviewer! Danger does not automatically notify them for you.
If needed, you can retry the
danger-review
job that generated this comment.Generated by
DangerAllure report
allure-report-publisher
generated test report!e2e-test-on-gdk:
test report for 2e56b454expand test summary
+------------------------------------------------------------------+ | suites summary | +-------------+--------+--------+---------+-------+-------+--------+ | | passed | failed | skipped | flaky | total | result | +-------------+--------+--------+---------+-------+-------+--------+ | Plan | 47 | 0 | 0 | 11 | 47 | ❗ | | Manage | 12 | 0 | 1 | 12 | 13 | ❗ | | Data Stores | 20 | 0 | 0 | 10 | 20 | ❗ | | Create | 19 | 0 | 0 | 17 | 19 | ❗ | | Govern | 19 | 0 | 0 | 12 | 19 | ❗ | | Verify | 8 | 0 | 0 | 5 | 8 | ❗ | +-------------+--------+--------+---------+-------+-------+--------+ | Total | 125 | 0 | 1 | 67 | 126 | ❗ | +-------------+--------+--------+---------+-------+-------+--------+
e2e-review-qa:
test report for 465ec73aexpand test summary
+-------------------------------------------------------------+ | suites summary | +--------+--------+--------+---------+-------+-------+--------+ | | passed | failed | skipped | flaky | total | result | +--------+--------+--------+---------+-------+-------+--------+ | Verify | 10 | 0 | 0 | 0 | 10 | ✅ | +--------+--------+--------+---------+-------+-------+--------+ | Total | 10 | 0 | 0 | 0 | 10 | ✅ | +--------+--------+--------+---------+-------+-------+--------+
e2e-package-and-test:
test report for 465ec73aexpand test summary
+-------------------------------------------------------------+ | suites summary | +--------+--------+--------+---------+-------+-------+--------+ | | passed | failed | skipped | flaky | total | result | +--------+--------+--------+---------+-------+-------+--------+ | Verify | 147 | 0 | 15 | 0 | 162 | ✅ | +--------+--------+--------+---------+-------+-------+--------+ | Total | 147 | 0 | 15 | 0 | 162 | ✅ | +--------+--------+--------+---------+-------+-------+--------+
Setting label grouprunner based on
@vshushlin
's group.added grouprunner label
added devopsverify sectionci labels
added 1922 commits
-
465ec73a...ddedadc8 - 1921 commits from branch
master
- ef7de932 - WIP
-
465ec73a...ddedadc8 - 1921 commits from branch
@ahegyi @pedropombeiro @grzesiek can you take a look at this PoC?
I recorded a video with some comments (on the very top).
Great job with this PoC so far @vshushlin
I left just one question to help me understand the table creation.
requested review from @ahegyi, @pedropombeiro, and @grzesiek
mentioned in merge request !123356 (merged)
1 # frozen_string_literal: true 2 3 module ClickHouse 4 module DataIngestion 5 class SyncCiFinishedBuildsService 6 # TODO: use exclusite lease or something to avoid concurrent execution 7 # TODO: limit number of records 8 def execute 9 builds_finished_since_last_sync 10 .each_batch(column: :finished_at, order_hint: :id) do |builds_batch| For the iteration, I can suggest this guide: https://docs.gitlab.com/ee/development/database/iterating_tables_in_batches.html#iterate-with-keyset-pagination
1 # frozen_string_literal: true 2 3 module ClickHouse 4 module DataIngestion 5 class SyncCiFinishedBuildsService 6 # TODO: use exclusite lease or something to avoid concurrent execution 7 # TODO: limit number of records 8 def execute 9 builds_finished_since_last_sync 10 .each_batch(column: :finished_at, order_hint: :id) do |builds_batch| 11 insert_builds(builds_batch.includes(:runner, :runner_manager)) # rubocop: disable CodeReuse/ActiveRecord 16 17 BUILD_FIELDS = [:id, :project_id, :pipeline_id, :status, :finished_at, 18 :created_at, :started_at, :queued_at, :runner_id, 19 :runner_manager_system_xid].freeze 20 RUNNER_FIELDS = [:run_untagged, :type].freeze 21 RUNNER_MANAGER_FIELDS = [:version, :revision, :platform, :architecture].freeze 22 23 def builds_finished_since_last_sync 24 finished_at, id = last_record_finished_at_and_id 25 26 return Ci::Build unless finished_at 27 28 Ci::Build.finished_after(finished_at, id) 29 end 30 31 def insert_builds(builds) I'm hoping to get some tooling implemented that makes batch inserts memory efficient using a tmp csv file + compression: #414937 (comment 1461641827)
I wonder how inefficient will that be if I load let's say 10k records in a single batch?
Or use similar approach #414937 (comment 1461641827): load smaller batches and concatenate
values
Temporarily writing 10K rows on the disk should be "cheaper" than keeping 10K rows in memory until a batch is done (GC might actually free up memory while a batch is written to disk since there is no more reference to those strings). Uploading a csv with 10K rows can be done with a streaming approach, using the
body_stream
option for HTTPParty.Anyway, this step should be abstracted away in the client itself.
10 finished_at DateTime64(6, 'UTC') DEFAULT now(), 11 created_at DateTime64(6, 'UTC') DEFAULT now(), 12 started_at DateTime64(6, 'UTC') DEFAULT now(), 13 queued_at DateTime64(6, 'UTC') DEFAULT now(), 14 runner_id UInt32 DEFAULT 0, 15 runner_manager_system_xid String DEFAULT '', 16 17 --- Runner fields 18 runner_run_untagged Boolean DEFAULT FALSE, 19 runner_type UInt8 DEFAULT 0, 20 runner_manager_version LowCardinality(String) DEFAULT '', 21 runner_manager_revision LowCardinality(String) DEFAULT '', 22 runner_manager_platform LowCardinality(String) DEFAULT '', 23 runner_manager_architecture LowCardinality(String) DEFAULT '' 24 ) 25 ENGINE = MergeTree 1 -- source table for CI analytics, almost useless on it's own, but it's a basis for creating materialized views 2 CREATE TABLE ci_finished_builds 3 ( 4 id UInt64 DEFAULT 0, 5 project_id UInt32 DEFAULT 0, 6 pipeline_id UInt32 DEFAULT 0, -- this is called commit_id in the main app 7 status LowCardinality(String) DEFAULT '', 8 9 --- Fields to calculate timings 10 finished_at DateTime64(6, 'UTC') DEFAULT now(), 11 created_at DateTime64(6, 'UTC') DEFAULT now(), 12 started_at DateTime64(6, 'UTC') DEFAULT now(), 13 queued_at DateTime64(6, 'UTC') DEFAULT now(), 14 runner_id UInt32 DEFAULT 0, 15 runner_manager_system_xid String DEFAULT '', 16 I'd suggest defining
age
as a materialized column: https://clickhouse.com/docs/en/sql-reference/statements/create/table#materializedOn-the-fly calculated expressions in aggregations and sorting will be slow since CH will need to calculate the age for all rows.
1 -- This table is is needed to quickly calculate max(finished at, id) on ci_finished_builds for data ingestion 2 CREATE MATERIALIZED VIEW ci_finished_builds_max_finished_at 3 ENGINE = AggregatingMergeTree() 4 ORDER BY (max_finished_at_id) 5 POPULATE AS 6 SELECT 7 maxSimpleState(tuple(finished_at, id)) as max_finished_at_id 1 CREATE MATERIALIZED VIEW ci_finished_builds_by_runner_type The general advice for defining MVs:
- Create an empty table that mimics your MVs schema.
- Create an MV pointing to the empty table.
See an example in the description: #414610 (closed)
AFAIK this will make it easier to alter the underlying DB structure (I haven't tested it yet).
1 CREATE MATERIALIZED VIEW ci_finished_builds_by_runner_type 2 ENGINE = AggregatingMergeTree() 3 ORDER BY (status, runner_type, started_at_bucket) Changing the sorting might "fix" the performance issue from the video. Since you're looking at the last N buckets:
Before measuring the query performance, ensure that the table is optimized:
OPTIMIZE TABLE my_table final;
- And that there are no BG merge operations running:
SELECT * FROM system.mutations;
(should return 0 rows)
Thanks, @vshushlin! I left a few comments.
173 174 scope :with_live_trace, -> { where('EXISTS (?)', Ci::BuildTraceChunk.where("#{quoted_table_name}.id = #{Ci::BuildTraceChunk.quoted_table_name}.build_id").select(1)) } 174 175 scope :with_stale_live_trace, -> { with_live_trace.finished_before(12.hours.ago) } 175 176 scope :finished_before, -> (date) { finished.where('finished_at < ?', date) } 177 178 # TODO: add index to facilitate this query (state, finished_at, id) 179 scope :finished_after, -> (finished_at, id = nil) { finished.where('finished_at > ?', finished_at).or(finished.where('finished_at = ? AND id > ?', finished_at, id)) } Why do we need to iterate builds instead of sending data to Clickhouse immediately once a build is complete? If there is a need for aggregation, we can build and aggregation data structure somewhere, but do we know that we need to do it actually?
Edited by Grzegorz Bizon@grzesiek ClickHouse likes to receive the data in (large) batches.
You can read more in https://clickhouse.com/docs/en/optimize/bulk-inserts
It also has build-in "asynchronous inserts", but I'm not sure how reliable they are given that CH batches the data in memory
@vshushlin So, how often do we want to run
SyncCiFinishedBuildsService
? Please be aware that we may be processing around 0.5 million builds per hour these days. It might be quite expensive to build a large bulk insert payload for 0.5M entries, right? The number can grow quite quickly over the next few years.Edited by Grzegorz BizonThe idea was to run it every 5 minutes, and the @ahegyi was suggesting to optimize building the payload with CSV file: #414937 (comment 1461641827)
Now I wonder though, if it will be possible to keep up with the load in a single thread
Actually, WDYT about using already existing partitioning to parallelize things? We can create one "loader" process per partition. (I'm not familiar with our partitioning at all)
Edited by Vladimir ShushlinNow I wonder though, if it will be possible to keep up with the load in a single thread
A single thread will go a long way, assuming 1s for loading 1K batch of builds from PG where we allow the max runtime to be 3 minutes (3 minutes PG + 2 minutes CH) we could process 150K+ rows (1.8m rows / hour).
Idea for parallelization: minute ranges
- Worker 1 looks at builds between 00-30 minute range.
- Worker 2 looks at builds between 30-60 minute range.
This might complicate the batching logic and we might need to track the cursors separately.
CSV writing and uploading process should be also batched/rotated. For example once we reach the 100K rows, we close the CSV file and invoke the uploader for CH, after that we can open a new CSV file.
1.8m rows / hour
@ahegyi As we are currently processing 0.5M per hour, soon it might be 1M, then 1.8M does not seem to offer a lot of headroom
@grzesiek, there are many options for tweaking the background jobs. Our "luck" is that we can safely offload the read queries to the replicas which means that we might run more queries per job. I wouldn't complicate the initial PoC with a parallel worker implementation (needs more thinking).
The 1 second runtime per batch was an estimation based on PG.ai which we know that it performs slower than PRD. Reading 1K rows from a random id takes 700ms: https://console.postgres.ai/gitlab/gitlab-production-ci/sessions/20980/commands/68663
Nice PoC @vshushlin! I left one, tiny question :)
removed review request for @grzesiek
18 :created_at, :started_at, :queued_at, :runner_id, 19 :runner_manager_system_xid].freeze 20 RUNNER_FIELDS = [:run_untagged, :type].freeze 21 RUNNER_MANAGER_FIELDS = [:version, :revision, :platform, :architecture].freeze 22 23 def builds_finished_since_last_sync 24 finished_at, id = last_record_finished_at_and_id 25 26 return Ci::Build unless finished_at 27 28 Ci::Build.finished_after(finished_at, id) 29 end 30 31 def insert_builds(builds) 32 query = <<~SQL 33 INSERT INTO ci_finished_builds (#{column_names.join(',')}) VALUES #{values(builds)} The query itself can be quite large, maybe somewhere between 100 megabytes and 1 gigabyte, given the amount of builds we are processing these days, depending on how often we run these INSERTs.
Edited by Grzegorz BizonThat's another argument in favor of the CSV approach: #414937 (comment 1461641827)
1 -- source table for CI analytics, almost useless on it's own, but it's a basis for creating materialized views 2 CREATE TABLE ci_finished_builds 3 ( 4 id UInt64 DEFAULT 0, 5 project_id UInt32 DEFAULT 0, 6 pipeline_id UInt32 DEFAULT 0, -- this is called commit_id in the main app 7 status LowCardinality(String) DEFAULT '', 8 9 --- Fields to calculate timings 10 finished_at DateTime64(6, 'UTC') DEFAULT now(), question: @vshushlin this is for my own understanding, as I'm sure there is a CH-specific reason for this: why do we define defaults for every column? In normal scenarios, we can assume that we know the
finished_at
value given that we're inserting finished buildsNullable has some negative effects on storage and performance: https://clickhouse.com/docs/en/sql-reference/data-types/nullable#storage-features
Ideally, we should validate and ensure that we send correct data to CH so default values are never used.
@ahegyi yeah I've seen that - my question was more regarding making some of the columns non-nullable, if we know that they should always contain a value.
added 2412 commits
-
f29a9a81...b66dd559 - 2410 commits from branch
master
- 110511ad - WIP
- 2e56b454 - WIP
-
f29a9a81...b66dd559 - 2410 commits from branch
@vshushlin Some end-to-end (E2E) tests have been selected based on the stage label on this MR.Please start the
trigger-omnibus-and-follow-up-e2e
job in theqa
stage and ensure the tests infollow-up-e2e:package-and-test-ee
pipeline are passing before this MR is merged. (The E2E test pipeline is computationally intensive and we cannot afford running it automatically for all pushes/rebases. Therefore, this job must be triggered manually after significant changes at least once.)If you would like to run all E2E tests, please apply the pipeline:run-all-e2e label and trigger a new pipeline. This will run all tests in
e2e:package-and-test
pipeline.The E2E test jobs are allowed to fail due to flakiness. For the list of known failures please refer to the latest pipeline triage issue.
Once done, please apply the
emoji on this comment.For any questions or help in reviewing the E2E test results, please reach out on the internal #quality Slack channel.
mentioned in issue #412709 (closed)
mentioned in issue #412714 (closed)
mentioned in issue #412720 (closed)
mentioned in issue #421199 (closed)
mentioned in issue #421200 (closed)
mentioned in issue #421202 (closed)