Skip to content

Configurable elastic bulk concurrency and size

Nick Thomas requested to merge (removed):configure-bulk-params into master

Part of gitlab#12375 (closed)

The gitlab-elasticsearch-indexer project is responsible for submitting documents to the elasticsearch server. It does so using the bulk API, and currently has some hardcoded logic - each gitlab-elasticsearch-indexer process can submit 10x10MiB bulk requests in parallel, requiring 100MiB of ES heap space to service.

There is a trade-off to be made between speed of indexing (many large bulks in parallel) and not overwhelming the elasticsearch server. The right place for knowledge about these trade-offs is gitlab, rather than the indexer project, so this MR makes the two values configurable, retaining the previous defaults for backward compatibility.

In the gitlab codebase, we can start by simply exposing the numbers as application configuration settings. In the future, we may want to dial concurrency up or down per-project depending on a wider view of load, detect the maximum permitted bulk size automatically, or some other strategy.

Fewer, larger bulk requests may be more efficient; the 10MiB limit we currently have comes from limits imposed by AWS on their small ES services, but their larger ones accept 100MiB bulks.

Merge request reports