Discussion: when should we scale up our Redis Clusters
We have 4 Redis Clusters in production (redis-cluster-cache
, redis-cluster-ratelimiting
, redis-cluster-chat-cache
, redis-cluster-feature-flag
) and 1 more on the horizon (redis-cluster-persistent
).
With &944 (closed) and &941 (closed) detailed, both epics aim to get the Gitlab Rails application able to operate robustly during an online resharding which is an important step of scaling up a Redis Cluster.
In the early stages of exploring Redis Cluster's viability, I ran a few load-test/benchmark experiments to understand the change in the behavior of a node during online resharding (#1965 (closed)). tldr; online resharding will involve a source node performing extra work to migrate keys to a target node. I'm uncertain if that could push a almost-saturating cluster past saturation during the resharding window (making it a high-risk process), but that is something to consider. We may need to revisit the benchmarks.
Looking at the current CPU saturation ratios, redis-cluster-ratelimiting
comes in highest with a ~25% saturation ratio.
We should define a soft CPU saturation threshold to determine when we should start work on:
- deciding on the steps to perform scaling and resharding; do we want to use the
redis-cli
's--cluster reshard
or control that process manually? - assess failure scenarios during resharding and how to handle them
- provision new nodes and perform resharding for the target cluster