Skip to content

Add option to batch requests to elastic

Mark Woodhall requested to merge feature/batching-es into develop

Initially I wrote a custom implementation based on core.asyc but then found out that spandex supports bulk indexing with a very similar implementation, so I have made us of that instead.

I've made this optional and configurable with the following environment variables:

  ES_BULK_INDEX ;; Enables bulk indexing
  ES_BULK_FLUSH_THRESHOLD ;; The number of index requests to batch before flushing
  ES_BULK_FLUSH_INTERVAL ;; The time to wait before flushing if threshold has not been met
  ES_BULK_MAX_REQUESTS ;; Maximum number of concurrent requests to ES

Before these changes I observer the following in local testing:

I've been doing a little bit of performance testing for the ingestion today, I've ran a couple of scenarios: Build an S3 metadata bucket containing 25,000 items Run 1 producer that generates messages for those 25,000 keys Run 1 consumer that subscribe to those messages and index in ElasticSearch That first scenario (part 2 & 3) takes 90 seconds, by some very unscientific method we could estimate that it would take ~60 minutes to index 1 million items. Obviously the limitation here is my machine. I'm running the producer, consumer, kafka, zookeeper, elasticsearch, and mongo. I tested another scenario too: Build an S3 metadata bucket containing 25,000 items Run 1 producer that generates messages for those 25,000 keys Run 3 consumer that subscribe to those messages and index in ElasticSearch As I expected there was no noticeable difference here, it took ~90 seconds. If anything it was a tiny bit slower than the first scenario, since the additional consumers are all competing for the same resource, and there is additional zookeeper co-ordination. Obviously it will be interesting to test similar on staging infrastructure. Should I have a quick look at improving the way we send queries to ES and see what the impact is? (edited) Also sent to the channel

I've repeated the above with a few changes: Optimised the producer, by reusing the same underlying Kafka producer for the duration of the task Running 9 consumers Using a real, well spec'd ElasticSearch cluster Availability zones 3 Instance type (data) r5.12xlarge.elasticsearch Number of nodes 3 Data nodes storage type EBS EBS volume type General Purpose (SSD) EBS volume size 50 GiB Instance type (master) r5.12xlarge.elasticsearch Number of master instances 3 Keep in mind that Kafka and the producer/consumers are still running on my local machine. The above produces the following results: It now takes 10 seconds to produce 25,000 messages to Kafka. Using this we can roughly calculate ~11 hours production time for 100,000,000 messages. Consuming these 25,000 messages (pulling from S3/indexing in ES) took 50 seconds. Using this we can roughly calculate ~55 hours to consume 100,000,000 messages. This is a reasonable improvement over the ~100 hours estimate from the first test. There is no indication from my testing that this won't continue to improve based on the resource we throw at it and the number of consumers we can run. Having said that, unless anyone thinks otherwise I will try to optimise so that we write batches of inserts to ElasticSearch, instead of one insert per request. What do you think @Dominika? (edited) Also sent to the channel

After these changes I observed the following in local testing:

I've run this again with a version that batches queries to ElasticSearch and seeing a 20% improvement over the 50 seconds it was taking previously. Which I think is reasonable. We will need to tweak the batch size to get the most of it but it is configurable at runtime so that should be trivial. The only thing I've got left to do is improve the logging.

I've not added any integration tests as part of this PR but I am hopeful they will follow. They are proving quite tricky to implement reliably so I wanted to break them out of this PR.

Merge request reports