Skip to content

benchmarking: stress test a small cluster to understand failure modes

The purpose of this effort is to understand how an ES cluster fails when it has a shard that is too big for it to handle

cluster spec

5 vms in total

region (vms spread across all 3 zones):

  • us-central-1

2 hot nodes:

  • RAM: 1GB
  • disk: 30GB
  • names: instance-0, instance-2

2 warm nodes:

  • RAM: 2GB
  • disk: 300GB
  • names: instance-1, instance-4

master node (eligible, but not being used as master):

  • RAM: 1GB
  • disk: 2GB
  • names: instance-3

index set up

alias pointing to an index template, a single index created from the template, 1 shard + 1 replica, the following ilm policy configured on the index:

{
    "indices": {
        "pubsub-nginx-inf-gprd-000001": {
            "index": "pubsub-nginx-inf-gprd-000001",
            "managed": true,
            "policy": "gitlab-infra-ilm-policy",
            "lifecycle_date_millis": 1564656405073,
            "phase": "hot",
            "phase_time_millis": 1564656405960,
            "action": "rollover",
            "action_time_millis": 1564656585628,
            "step": "check-rollover-ready",
            "step_time_millis": 1564656585628,
            "phase_execution": {
                "policy": "gitlab-infra-ilm-policy",
                "phase_definition": {
                    "min_age": "0ms",
                    "actions": {
                        "rollover": {
                            "max_size": "200gb",
                            "max_age": "10d"
                        },
                        "set_priority": {
                            "priority": 50
                        }
                    }
                },
                "version": 4,
                "modified_date_in_millis": 1564656125625
            }
        }
    }
}

shards

2 shards, on hot nodes:

  • primary: instance-2
  • replica: instance-0

All other pubsubbeats in stg stopped, this was the only index that was growing at the time

ES monitoring metrics

monitoring metrics sent to a separate cluster

data source

nginx logs from gprd: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1553

results

the time axis on some of the screenshots below is in CEST (UTC+2)

failures started happening around 13:50 CEST (11:50 UTC)

Screenshot_2019-08-01_at_15.25.47

Screenshot_2019-08-01_at_15.30.48

the cluster is not failing completely, it just processes new documents at a lower rate:

Screenshot_2019-08-01_at_15.34.36

a number of stats from instance-2 (hot node with the primary shard):

Screenshot_2019-08-01_at_15.37.24

Screenshot_2019-08-01_at_15.40.36

Screenshot_2019-08-01_at_15.41.11

Screenshot_2019-08-01_at_15.41.29

Screenshot_2019-08-01_at_15.41.56

Screenshot_2019-08-01_at_15.45.48

Screenshot_2019-08-01_at_15.48.02

stats from instance-0 (hot node with the replica):

Screenshot_2019-08-01_at_15.52.07

Screenshot_2019-08-01_at_15.52.21

Screenshot_2019-08-01_at_15.52.36

Screenshot_2019-08-01_at_15.52.50

Screenshot_2019-08-01_at_15.53.06

Screenshot_2019-08-01_at_15.53.25

Screenshot_2019-08-01_at_15.53.49

Index size at the time:

Screenshot_2019-08-01_at_16.05.35

conclusions

good indicators of cluster being overloaded:

  • indexing and search latency in index stats
  • relative indexing and request rates
  • cgroup cpu utilisation
  • number of indexing threads
  • indexing time

bad indicators:

  • jvm heap

estimated optimal shard size: 90% * 2.3GB =~ 2GB

Edited by 🤖 GitLab Bot 🤖