benchmarking: stress test a small cluster to understand failure modes

The purpose of this effort is to understand how an ES cluster fails when it has a shard that is too big for it to handle

cluster spec

5 vms in total

region (vms spread across all 3 zones):

us-central-1

2 hot nodes:

RAM: 1GB
disk: 30GB
names: instance-0, instance-2

2 warm nodes:

RAM: 2GB
disk: 300GB
names: instance-1, instance-4

master node (eligible, but not being used as master):

RAM: 1GB
disk: 2GB
names: instance-3

index set up

alias pointing to an index template, a single index created from the template, 1 shard + 1 replica, the following ilm policy configured on the index:

{
    "indices": {
        "pubsub-nginx-inf-gprd-000001": {
            "index": "pubsub-nginx-inf-gprd-000001",
            "managed": true,
            "policy": "gitlab-infra-ilm-policy",
            "lifecycle_date_millis": 1564656405073,
            "phase": "hot",
            "phase_time_millis": 1564656405960,
            "action": "rollover",
            "action_time_millis": 1564656585628,
            "step": "check-rollover-ready",
            "step_time_millis": 1564656585628,
            "phase_execution": {
                "policy": "gitlab-infra-ilm-policy",
                "phase_definition": {
                    "min_age": "0ms",
                    "actions": {
                        "rollover": {
                            "max_size": "200gb",
                            "max_age": "10d"
                        },
                        "set_priority": {
                            "priority": 50
                        }
                    }
                },
                "version": 4,
                "modified_date_in_millis": 1564656125625
            }
        }
    }
}