Design Redis Cluster scaling methodology
This issue discusses the approach to scaling gitlab.com's Redis Clusters. The outcome would be the process for doing so, including runbook scripts and potential changes to the Redis Cluster cookbooks.
General overview
- Provision extra nodes
- Add nodes to existing Redis Cluster
- Reshard/rebalance slots from existing cluster to the new cluster
Considerations
- New node's discoverability to the application
- Time frame/horizon over which the resharding is performed. Do we continuously run the resharding across time zones or do we only reshard during low-traffic periods?
- Do we use the
redis-cli --cluster reshard
command? Or a more involved/manual approach usingCLUSTER SETSLOT
andMIGRATE
?
Proposed approach
I'm writing a general approach here to get the ball rolling
Provisioning
- Update chef-repo with new roles (e.g. https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/blob/master/roles/gprd-base-db-redis-cluster-cache-shard-01.json)
- Provision nodes using config-mgmt (https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/blob/master/environments/gstg/redis-cluster.tf#L56) by increasing the number of nodes.
Update topology
- Add 1 node to the rest of the Redis Cluster using
CLUSTER MEET <existing host> <existing ip>
- Add remaining nodes to the cluster and assign as replica to the first
Reshard nodes
- Run
redis-cli --cluster reshard localhost:6379
and add the relevant inputs as prompted.
Click to expand
➜ ~ redis-cli --cluster reshard 127.0.0.1:6379
>>> Performing Cluster Check (using node 127.0.0.1:6379)
S: b45555f7f5d029e831f35086d86393d6e988c0ec 127.0.0.1:6379
slots: (0 slots) slave
replicates afaf2f3cac35fed698b80b0678d8786ba0a18fe8
M: afaf2f3cac35fed698b80b0678d8786ba0a18fe8 127.0.0.1:6380
slots: (0 slots) master
1 additional replica(s)
S: a21a523fdb4b579a28e7010b453cd56c23c0dfec 127.0.0.1:7203
slots: (0 slots) slave
replicates 88b0070c0c423a704ea9fa40d0b5d8fe7cc4bbf5
S: b89f3fb6a658584ca2311f7d9c94fa32de1d0f9a 127.0.0.1:7102
slots: (0 slots) slave
replicates 8da35ed9b8d8a84940c40fdab232c1a814143916
S: b7dfe199f5589ae923a106de60f6b9b4a22a82d7 127.0.0.1:7202
slots: (0 slots) slave
replicates 88b0070c0c423a704ea9fa40d0b5d8fe7cc4bbf5
S: f48e34ffbfd4520e9dff1b9aa1b61938999900bb 127.0.0.1:7003
slots: (0 slots) slave
replicates 32d7f14ee59384186223bc20e9d67fe619fc9a21
M: 8da35ed9b8d8a84940c40fdab232c1a814143916 127.0.0.1:7101
slots:[5462-10922] (5461 slots) master
2 additional replica(s)
S: 54f91518994d7a6c79961622b5c9a4db70e7646e 127.0.0.1:7002
slots: (0 slots) slave
replicates 32d7f14ee59384186223bc20e9d67fe619fc9a21
M: 32d7f14ee59384186223bc20e9d67fe619fc9a21 127.0.0.1:7001
slots:[10923-16383] (5461 slots) master
2 additional replica(s)
M: 88b0070c0c423a704ea9fa40d0b5d8fe7cc4bbf5 127.0.0.1:7201
slots:[0-5461] (5462 slots) master
2 additional replica(s)
S: de51b40c0d8fc9b4ffbd2ce5dd3bce3e152b3f3d 127.0.0.1:7103
slots: (0 slots) slave
replicates 8da35ed9b8d8a84940c40fdab232c1a814143916
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
How many slots do you want to move (from 1 to 16384)? 10
What is the receiving node ID? afaf2f3cac35fed698b80b0678d8786ba0a18fe8
Please enter all the source node IDs.
Type 'all' to use all the nodes as source nodes for the hash slots.
Type 'done' once you entered all the source nodes IDs.
Source node #1: all
Ready to move 10 slots.
Source nodes:
M: 8da35ed9b8d8a84940c40fdab232c1a814143916 127.0.0.1:7101
slots:[5462-10922] (5461 slots) master
2 additional replica(s)
M: 32d7f14ee59384186223bc20e9d67fe619fc9a21 127.0.0.1:7001
slots:[10923-16383] (5461 slots) master
2 additional replica(s)
M: 88b0070c0c423a704ea9fa40d0b5d8fe7cc4bbf5 127.0.0.1:7201
slots:[0-5461] (5462 slots) master
2 additional replica(s)
Destination node:
M: afaf2f3cac35fed698b80b0678d8786ba0a18fe8 127.0.0.1:6380
slots: (0 slots) master
1 additional replica(s)
Resharding plan:
Moving slot 0 from 88b0070c0c423a704ea9fa40d0b5d8fe7cc4bbf5
Moving slot 1 from 88b0070c0c423a704ea9fa40d0b5d8fe7cc4bbf5
Moving slot 2 from 88b0070c0c423a704ea9fa40d0b5d8fe7cc4bbf5
Moving slot 3 from 88b0070c0c423a704ea9fa40d0b5d8fe7cc4bbf5
Moving slot 5462 from 8da35ed9b8d8a84940c40fdab232c1a814143916
Moving slot 5463 from 8da35ed9b8d8a84940c40fdab232c1a814143916
Moving slot 5464 from 8da35ed9b8d8a84940c40fdab232c1a814143916
Moving slot 10923 from 32d7f14ee59384186223bc20e9d67fe619fc9a21
Moving slot 10924 from 32d7f14ee59384186223bc20e9d67fe619fc9a21
Moving slot 10925 from 32d7f14ee59384186223bc20e9d67fe619fc9a21
Do you want to proceed with the proposed reshard plan (yes/no)? yes
Moving slot 0 from 127.0.0.1:7201 to 127.0.0.1:6380:
Moving slot 1 from 127.0.0.1:7201 to 127.0.0.1:6380:
Moving slot 2 from 127.0.0.1:7201 to 127.0.0.1:6380:
Moving slot 3 from 127.0.0.1:7201 to 127.0.0.1:6380:
Moving slot 5462 from 127.0.0.1:7101 to 127.0.0.1:6380:
Moving slot 5463 from 127.0.0.1:7101 to 127.0.0.1:6380:
Moving slot 5464 from 127.0.0.1:7101 to 127.0.0.1:6380:
Moving slot 10923 from 127.0.0.1:7001 to 127.0.0.1:6380:
Moving slot 10924 from 127.0.0.1:7001 to 127.0.0.1:6380:
Moving slot 10925 from 127.0.0.1:7001 to 127.0.0.1:6380: