Incubation:APM ClickHouse evaluation
So far we've tested a basic ClickHouse schema against TimescaleDB using apm/tsbs.
ClickHouse performs very well, performing better than TimescaleDB in almost every category - see results for in depth graphs.
There are many databases to choose from in this domain. There are lots of time series specific databases - https://db-engines.com/en/ranking/time+series+dbms - but we ideally want something that can handle a variety of o11y data, not just time series specifically.
ClickHouse is gaining traction in a number of domains. It is horizontally scalable, with replication, provides an SQL syntax and can be used to perform large analytical queries.
We want to evaluate ClickHouse as a horizontally scalable datastore for o11y data (metrics, logs, traces).
There is some precedent for using ClickHouse in this area:
Articles about benchmarks for ClickHouse
- ClickHouse vs Amazon Redshift for Time Series Data - http://brandonharris.io/redshift-clickhouse-time-series/
- ClickHouse vs Spark/MySQL - https://www.percona.com/blog/2017/02/13/clickhouse-new-opensource-columnar-database/
Time Series Benchmarking Suite (TSBS)
We're using tsbs to benchmark ClickHouse schemas against other solutions.
Currently supported databases for tsbs are:
|InfluxDB||No||Yes (Enterprise only)||High||High|
In this iteration we're limiting ourselves to multi-modal databases that we might be able to use for all o11y data, rather than have to support multiple databases.
In this case we will benchmark various ClickHouse schemas against CrateDB, MongoDB and TimescaleDB.
Experimental setup scripts and bug fixes in https://gitlab.com/gitlab-org/incubation-engineering/apm/tsbs
Initial TimescaleDB vs ClickHouse comparison with varying DevOps workloads and data sizes.
Comparison with CrateDB and MongoDB too - see #12 (closed)
Capture and present CPU, memory and disk storage usage for each database.
Key configuration items:
USE_CASE- selection of particular time-series data generation and queries, we use "cpu-only" (i.e. only CPU data) and "devops" (i.e. large ops sample with various tables).
SCALE- sets number of hosts in data simulation. 100 hosts results in 259200000 individual metrics for "cpu-only"
FORMATS- database formats to use for generation and loading.
NUM_WORKERS- number of workers used for parallel loading/querying.
TSBS is used to script tests against target databases in the following flow:
- Start cadvisor/prometheus for monitoring DB containers.
- For each format:
- Generate data based on
- Generate queries based on
- Start target database.
- Load data into target DB (metrics files created).
- Execute queries against target DB (metrics files created).
- Stop DB and clean up.
- Acquire container metrics from Prometheus (cpu, memory, storage).
- Generate data based on
Example Data for "devops"
For "cpu-only" the "cpu" and "tags" schemas are the only ones in use.
hostname string,region string,datacenter string,rack string,os string,arch string,team string,service string,service_version string,service_environment string
Example data snippet:
tags,hostname=host_0,region=eu-west-1,datacenter=eu-west-1c,rack=87,os=Ubuntu16. 04LTS,arch=x64,team=NYC,service=18,service_version=1,service_environment=product ion cpu,1451606400000000000,58,2,24,61,22,63,6,44,80,38 tags,hostname=host_1,region=ap-southeast-1,datacenter=ap-southeast-1b,rack=97,os =Ubuntu15.10,arch=x86,team=LON,service=12,service_version=0,service_environment= production diskio,1451606400000000000,0,0,0,0,0,0,0 etc etc
Example Queries for "devops"
Example queries and descriptions for those generated by the benchmark.
|cpu-max-all-1||max of all CPU metrics, random host, random 8h segment by 1h||cpu|
|cpu-max-all-8||max of all CPU metrics, random 8 hosts, random 8h segment by 1h||cpu|
|double-groupby-1||mean of 1 metric, all hosts, random 12h segment by 1h||cpu|
|double-groupby-5||mean of 5 metrics, all hosts, random 12h segment by 1h||cpu|
|double-groupby-all||mean of 10 metrics, all hosts, random 12h by 1h||cpu|
|groupby-orderby-limit||max cpu over last 5 min-intervals (random end)||cpu|
|high-cpu-1||CPU over threshold, 1 host||cpu|
|high-cpu-all||CPU over threshold, all hosts||cpu|
|lastpoint||Last row per host||cpu|
|single-groupby-1-1-1||1 cpu metric, random host, random 1h group by 1m||cpu|
|single-groupby-1-1-12||1 cpu metric, random host, random 12h group by 1m||cpu|
|single-groupby-1-8-1||1 cpu metric, random 8 hosts, random 1h group by 1m||cpu|
Examples of each query in for ClickHouse:
SELECT toStartOfHour(created_at) AS hour, max(usage_user) AS max_usage_user, max(usage_system) AS max_usage_system, max(usage_idle) AS max_usage_idle, max(usage_nice) AS max_usage_nice, max(usage_iowait) AS max_usage_iowait, max(usage_irq) AS max_usage_irq, max(usage_softirq) AS max_usage_softirq, max(usage_steal) AS max_usage_steal, max(usage_guest) AS max_usage_guest, max(usage_guest_nice) AS max_usage_guest_nice FROM cpu WHERE tags_id IN (SELECT id FROM tags WHERE hostname IN ('host_49')) AND (created_at >= '2016-01-01 11:48:31') AND (created_at < '2016-01-01 19:48:31') GROUP BY hour ORDER BY hour
SELECT toStartOfHour(created_at) AS hour, max(usage_user) AS max_usage_user, max(usage_system) AS max_usage_system, max(usage_idle) AS max_usage_idle, max(usage_nice) AS max_usage_nice, max(usage_iowait) AS max_usage_iowait, max(usage_irq) AS max_usage_irq, max(usage_softirq) AS max_usage_softirq, max(usage_steal) AS max_usage_steal, max(usage_guest) AS max_usage_guest, max(usage_guest_nice) AS max_usage_guest_nice FROM cpu WHERE tags_id IN (SELECT id FROM tags WHERE hostname IN ('host_49','host_3','host_35','host_39','host_75','host_15','host_21','host_11')) AND (created_at >= '2016-01-0111:48:31') AND (created_at < '2016-01-01 19:48:31') GROUP BY hour ORDER BY hour
SELECT hour, hostname, mean_usage_user FROM ( SELECT toStartOfHour(created_at) AS hour, tags_id AS id, avg(usage_user) AS mean_usage_user FROM cpu WHERE (created_at >= '2016-01-02 11:22:40') AND (created_at < '2016-01-02 23:22:40') GROUP BY hour, id ) AS cpu_avg ANY INNER JOIN tags USING (id) ORDER BY hour ASC, hostname
SELECT hour, hostname, mean_usage_user, mean_usage_system, mean_usage_idle, mean_usage_nice, mean_usage_iowait FROM ( SELECT toStartOfHour(created_at) AS hour, tags_id AS id, avg(usage_user) AS mean_usage_user, avg(usage_system) AS mean_usage_system, avg(usage_idle) AS mean_usage_idle, avg(usage_nice) AS mean_usage_nice, avg(usage_iowait) AS mean_usage_iowait FROM cpu WHERE (created_at >= '2016-01-02 11:22:40') AND (created_at < '2016-01-02 23:22:40') GROUP BY hour, id ) AS cpu_avg ANY INNER JOIN tags USING (id) ORDER BY hour ASC, hostname
SELECT hour, hostname, mean_usage_user, mean_usage_system, mean_usage_idle, mean_usage_nice, mean_usage_iowait, mean_usage_irq, mean_usage_softirq, mean_usage_steal, mean_usage_guest, mean_usage_guest_nice FROM ( SELECT toStartOfHour(created_at) AS hour, tags_id AS id, avg(usage_user) AS mean_usage_user, avg(usage_system) AS mean_usage_system, avg(usage_idle) AS mean_usage_idle, avg(usage_nice) AS mean_usage_nice, avg(usage_iowait) AS mean_usage_iowait, avg(usage_irq) AS mean_usage_irq, avg(usage_softirq) AS mean_usage_softirq, avg(usage_steal) AS mean_usage_steal, avg(usage_guest) AS mean_usage_guest, avg(usage_guest_nice) AS mean_usage_guest_nice FROM cpu WHERE (created_at >= '2016-01-02 11:22:40') AND (created_at < '2016-01-02 23:22:40') GROUP BY hour, id ) AS cpu_avg ANY INNER JOIN tags USING (id) ORDER BY hour ASC, hostname
SELECT toStartOfMinute(created_at) AS minute, max(usage_user) FROM cpu WHERE created_at < '2016-01-01 22:38:24' GROUP BY minute ORDER BY minute DESC LIMIT 5
SELECT * FROM cpu PREWHERE (usage_user > 90.0) AND (created_at >= '2016-01-01 21:18:55') AND (created_at < '2016-01-02 09:18:55') AND (tags_id IN (SELECT id FROM tags WHERE hostname IN ('host_3')))
SELECT * FROM cpu PREWHERE (usage_user > 90.0) AND (created_at >= '2016-01-02 23:35:31') AND (created_at < '2016-01-03 11:35:31')
SELECT * FROM ( SELECT * FROM cpu WHERE (tags_id, created_at) IN ( SELECT tags_id, max(created_at) FROM cpu GROUP BY tags_id ) ) AS c ANY INNER JOIN tags AS t ON c.tags_id = t.id ORDER BY t.hostname ASC, c.time DESC
SELECT toStartOfMinute(created_at) AS minute, max(usage_user) AS max_usage_user FROM cpu WHERE tags_id IN (SELECT id FROM tags WHERE hostname IN ('host_49')) AND (created_at >= '2016-01-03 12:26:46') AND (created_at < '2016-01-03 13:26:46') GROUP BY minute ORDER BY minute ASC
Test machine details:
- Google Cloud VM - n2-standard-16 (16 vCPUs, 64 GB memory)
- Intel(R) Xeon(R) CPU, 2800 MHz
200GB500GB (200GB wasn't enough space!) persistent SSD
Linux jshaw-clickhouse-benchmarking 5.11.0-1017-gcp #19~20.04.1-Ubuntu SMP Thu Aug 12 05:25:25 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Methodology: Use Google Cloud VM with Docker to best replicate a GitLab production environment while keeping it simple and repeatable. Benchmarking scripts clean the environment between each run (stop containers, delete volumes, etc.). 4 benchmark workers are used as this is similar to other benchmarks we've seen and it will minimize contention on the CPU resource.
TimescaleDB vs ClickHouse
ClickHouse significantly outperforms TimescaleDB in metric storage rate (therefore time taken is much shorter too).
ClickHouse has a much smaller data volume footprint in all cases by almost a factor of 10.
Query p95 latencies (1000 queries per use-case/query type):
Generally ClickHouse performs better in the majority of query types. For
group-by-orderby-limit ClickHouse performs consistently worse by a significant margin, this could likely be improved with better indexing.
There are very few ClickHouse queries that have >1s latency at q95. TimescaleDB has multiple >1s latencies, including a few in the range of 15-25s.
CPU utilization is fairly consistent between the databases. ClickHouse consistently outperforms TimescaleDB in terms of total memory usage over the tests. With the larger datasets TimescaleDB creeps upwards of 60GiB memory utilization, which may indicate that the machine size was impacting on performance.
- Workers and databases are on the same machine, which is not realistic. However this removes the complexity of network performance from the benchmark.
- 4 workers are used for all benchmarks. It would be interesting to see how the databases perform with more or less workers.
- Database default settings are used. Better performance may be achievable by tuning various parameters.
- Resources were overutilized during the larger runs. This definitely impacted on individual DB performance.
- Queries only target
cpu, this is probably for simplicity but isn't realistic.