benchmarking: stress test a small cluster to understand failure modes
The purpose of this effort is to understand how an ES cluster fails when it has a shard that is too big for it to handle
cluster spec
5 vms in total
region (vms spread across all 3 zones):
- us-central-1
2 hot nodes:
- RAM: 1GB
- disk: 30GB
- names: instance-0, instance-2
2 warm nodes:
- RAM: 2GB
- disk: 300GB
- names: instance-1, instance-4
master node (eligible, but not being used as master):
- RAM: 1GB
- disk: 2GB
- names: instance-3
index set up
alias pointing to an index template, a single index created from the template, 1 shard + 1 replica, the following ilm policy configured on the index:
{
"indices": {
"pubsub-nginx-inf-gprd-000001": {
"index": "pubsub-nginx-inf-gprd-000001",
"managed": true,
"policy": "gitlab-infra-ilm-policy",
"lifecycle_date_millis": 1564656405073,
"phase": "hot",
"phase_time_millis": 1564656405960,
"action": "rollover",
"action_time_millis": 1564656585628,
"step": "check-rollover-ready",
"step_time_millis": 1564656585628,
"phase_execution": {
"policy": "gitlab-infra-ilm-policy",
"phase_definition": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_size": "200gb",
"max_age": "10d"
},
"set_priority": {
"priority": 50
}
}
},
"version": 4,
"modified_date_in_millis": 1564656125625
}
}
}
}
shards
2 shards, on hot nodes:
- primary:
instance-2 - replica:
instance-0
All other pubsubbeats in stg stopped, this was the only index that was growing at the time
ES monitoring metrics
monitoring metrics sent to a separate cluster
data source
nginx logs from gprd: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1553
results
the time axis on some of the screenshots below is in CEST (UTC+2)
failures started happening around 13:50 CEST (11:50 UTC)
the cluster is not failing completely, it just processes new documents at a lower rate:
a number of stats from instance-2 (hot node with the primary shard):
stats from instance-0 (hot node with the replica):
Index size at the time:
conclusions
good indicators of cluster being overloaded:
- indexing and search latency in index stats
- relative indexing and request rates
- cgroup cpu utilisation
- number of indexing threads
- indexing time
bad indicators:
- jvm heap
estimated optimal shard size: 90% * 2.3GB =~ 2GB

















