Draft: Make collector batching configurable in cluster CR
After running some experiments, we've witnessed significant improvements when inserting larger batches into ClickHouse. We've landed on a batching configuration that works well for our current loads.
The goal now of this MR is to make the batching configuration configurable in the Cluster CR in order to configure different values for dev/testing/production.
Merge request reports
Activity
added groupobservability label
assigned to @mappelman
added devopsmonitor sectionanalytics labels
- Resolved by 🤖 GitLab Bot 🤖
Proper labels assigned to this merge request. Please ignore me.
@mappelman
- please see the following guidance and update this merge request.1 Error Please add typebug typefeature, or typemaintenance label to this merge request. Edited by 🤖 GitLab Bot 🤖
added typemaintenance label
requested review from @vespian_gl, @ankitbhatnagar, and @arun.sori
@mappelman any risk with upping send_batch_size by ~ 10x? Would a progressive set of smaller increases be better?
@nicholasklick Only risk is very small vs the potentially huge reward - more spans will be lost if the collector crashes. The collector is extremely stable and hasn't ever crashed (confirmed in dashboards). It's been shutdown with correct flushing before restarting a new during upgrades.
A binary search-type approach here will save much time over small incremental increases.
@nicholasklick it is recommended by CH to send between 10k and "millions" of rows in a single INSERT.
Here is a great 3 part series on the matter https://clickhouse.com/blog/supercharge-your-clickhouse-data-loads-part2
29 29 key: gitlab.target_namespace_id 30 30 from_context: metadata.x-target-namespaceid 31 31 batch: 32 send_batch_size: 25000 33 send_batch_max_size: 50000 34 timeout: 5s 32 send_batch_size: 200000 33 send_batch_max_size: 250000 34 timeout: 10s nitpick(non-blocking): We are bumping size/maxsize by the factor of x4 and x5, whereas time by x2. What is reasoning behind these numbers? Why not immediately go with 1M event/60 seconds - we are not aiming at being real-time anyway. Is there a max batch size for CH? Is it possible that we create a batch big enough to cause a CH resource usage spike?
Edited by Pawel Rozlachsuggestion: This increses the memory usage of the collector. Rate-limits are adjusted to the memory usage of the collector and should be revised accordingly. See also https://ops.gitlab.net/opstrace-realm/environments/gcp/observability/-/merge_requests/141
see note below !2565 (comment 1929689932)
changed this line in version 3 of the diff