Skip to content

Discussion: when should we scale up our Redis Clusters

We have 4 Redis Clusters in production (redis-cluster-cache, redis-cluster-ratelimiting, redis-cluster-chat-cache, redis-cluster-feature-flag) and 1 more on the horizon (redis-cluster-persistent).

With &944 and &941 detailed, both epics aim to get the Gitlab Rails application able to operate robustly during an online resharding which is an important step of scaling up a Redis Cluster.

In the early stages of exploring Redis Cluster's viability, I ran a few load-test/benchmark experiments to understand the change in the behavior of a node during online resharding (#1965 (closed)). tldr; online resharding will involve a source node performing extra work to migrate keys to a target node. I'm uncertain if that could push a almost-saturating cluster past saturation during the resharding window (making it a high-risk process), but that is something to consider. We may need to revisit the benchmarks.

Looking at the current CPU saturation ratios, redis-cluster-ratelimiting comes in highest with a ~25% saturation ratio.

Screenshot_2023-08-10_at_10.56.44_AM

source

We should define a soft CPU saturation threshold to determine when we should start work on:

  • deciding on the steps to perform scaling and resharding; do we want to use the redis-cli's --cluster reshard or control that process manually?
  • assess failure scenarios during resharding and how to handle them
  • provision new nodes and perform resharding for the target cluster
Edited by Sylvester Chin