Discussion: when should we scale up our Redis Clusters
We have 4 Redis Clusters in production (redis-cluster-cache, redis-cluster-ratelimiting, redis-cluster-chat-cache, redis-cluster-feature-flag) and 1 more on the horizon (redis-cluster-persistent).
With &944 and &941 detailed, both epics aim to get the Gitlab Rails application able to operate robustly during an online resharding which is an important step of scaling up a Redis Cluster.
In the early stages of exploring Redis Cluster's viability, I ran a few load-test/benchmark experiments to understand the change in the behavior of a node during online resharding (#1965 (closed)). tldr; online resharding will involve a source node performing extra work to migrate keys to a target node. I'm uncertain if that could push a almost-saturating cluster past saturation during the resharding window (making it a high-risk process), but that is something to consider. We may need to revisit the benchmarks.
Looking at the current CPU saturation ratios, redis-cluster-ratelimiting comes in highest with a ~25% saturation ratio.
We should define a soft CPU saturation threshold to determine when we should start work on:
- deciding on the steps to perform scaling and resharding; do we want to use the redis-cli's--cluster reshardor control that process manually?
- assess failure scenarios during resharding and how to handle them
- provision new nodes and perform resharding for the target cluster
