Redis Cluster: Load experiments for resharding under load

The biggest benefit of redis cluster is the ability to scale horizontally by adding extra shards which takes up 1/n of the keyslots (where n is the number of shards).

As @reprazent voiced out earlier, we need to understand the impact of migrating keyslots during scale up (n-1 shards will migrate 1/n-1 - 1/n keyslots to the new shard) and scale down (exiting shard migrates 1/n keyslots to the remaining n-1 shards) on CPU utilisation. This is important as we could end up scaling up too late if we underestimate the impact of migrating keyslots.

This issue's goal is to understand how redis behaves when resharding/rebalancing under production load to surface any operational concerns.

Refer to #1945 (closed)

Setup (referenced #1346 (closed))

Using https://gitlab.com/gitlab-com/gl-infra/redis-load-test to generate tests within each VM (k6 on a separate c2-standard-8 (8CPU, 32GB RAM))
Using omnibus-gitlab with cookbook attributes from https://gitlab.com/gitlab-org/omnibus-gitlab/-/blob/schin1-redis-load-test/files/gitlab-cookbooks/redis/attributes/attributes.rb for each VM (c2-standard-4 (4CPU, 16GB RAM))

k6 command

# SKIP_DATA_POPULATION added for subsequent runs
./_build/k6 run ./script.js --env REDIS_CLUSTER_NODE=<VM public IP>:6379 --env TRACE_PATH=./traces/cache-trace-2021-11-05.json --env TRACE_DURATION=15s --env SKIP_DATA_POPULATION=1 --env DURATION=20m

An issue with this is that the data populated is small compared to redis-cache in production. This also means the amount of data (keys) being migrated during resharding may not be accurate. We might need a more updated trace.

Node topology

3 master nodes (no replicas) + 3 load generators (~15% CPU utilisation on each shard) - to understand flamegraph and areas to focus on
9 nodes (3 master + 6 replicas) + 8(?) load generators

Refer to https://gitlab.com/gitlab-com/gl-infra/redis-cluster-sandbox for quick way to bring up a 9-node redis cluster on VMs.

Edited Oct 20, 2022 by Sylvester Chin