Architecture of Self-Managed Elasticsearch

For the Elasticsearch cluster supporting GitLab.com's Advanced Search feature, we will be migrating from Elastic Cloud to self-managed on VMs. As future work, we may later migrate into Kubernetes, but by migrating initially to VMs we can focus on Elasticsearch itself rather than the additional abstraction and management aspects of running in GKE.

We are tentatively planning to run:

OS: Ubuntu 20.04 (even though most of our VM fleet is currently still on 18.04)
GCP zones: 3
Replicas: 2 (plus 1 primary), so 3 copies total. This means we will still have redundancy even if we lose any 1 zone.

Notes on sizing per ES node:

CPU: Probably between 8-32, depending on test results.
- Needs to accommodate combined workload of: ingestion, query, shard rebalancing, and GC.
- Implicitly affects GC concurrency and pause duration.
- Implicitly affects network egress throughput limit (which affects workloads like shard rebalancing).
- Might implicitly affect default ES thread pool sizing for concurrency-safe tasks (e.g. queries that access multiple shards).
Memory: Up to 30 GB JVM heap, plus a generous amount for filesystem cache
Storage: Calculate this from current dataset size, allowing for:
- 3 replicas instead of 2
- difference in node count
- enough padding for at least 6 months of long-term data growth (can grow filesystem online)
- enough padding for short-term spikes due to: segment merging, abrupt allocation of shards due to peer node failure, potentially unequal distribution of shard growth, etc.

Edited Jan 15, 2021 by Matt Smiley