Discussion: when should we scale up our Redis Clusters

We have 4 Redis Clusters in production (redis-cluster-cache, redis-cluster-ratelimiting, redis-cluster-chat-cache, redis-cluster-feature-flag) and 1 more on the horizon (redis-cluster-persistent).

With &944 and &941 detailed, both epics aim to get the Gitlab Rails application able to operate robustly during an online resharding which is an important step of scaling up a Redis Cluster.

In the early stages of exploring Redis Cluster's viability, I ran a few load-test/benchmark experiments to understand the change in the behavior of a node during online resharding (#1965 (closed)). tldr; online resharding will involve a source node performing extra work to migrate keys to a target node. I'm uncertain if that could push a almost-saturating cluster past saturation during the resharding window (making it a high-risk process), but that is something to consider. We may need to revisit the benchmarks.

Looking at the current CPU saturation ratios, redis-cluster-ratelimiting comes in highest with a ~25% saturation ratio.

source

We should define a soft CPU saturation threshold to determine when we should start work on:

deciding on the steps to perform scaling and resharding; do we want to use the redis-cli's --cluster reshard or control that process manually?
assess failure scenarios during resharding and how to handle them
provision new nodes and perform resharding for the target cluster

Edited Aug 10, 2023 by Sylvester Chin