Redis Operability Guidelines

The purpose of this issue is to collaborate on the operational requirements, and potential future state, of our Redis infrastructure that runs on GitLab's SaaS platforms.

groupscalability are currently working in two streams:

Introducing Redis Cluster &823 (closed)
Functional Partitioning &857 (closed)

The primary driver around the partitioning effort is to buy us headroom in redis-cache, which is currently constrained to running on a single master. However, partitioning also brings benefits that come with isolation (blast-radius etc.), perhaps at the cost of having a bigger infrastructure (compute costs, operational complexity etc.).

By introducing Redis Cluster, headroom concerns become less of a constraint - we can horizontally scale usage across multiple masters. This flexibility gives us more options for how we optimally scale our Redis workloads, which raises the following questions:

Will we run "one big" cluster?
Will partitioned instances remain partitioned, or move to the main cluster?
If we favour isolation, what are the conditions for further partitioning redis-cache, even when it runs on Redis Cluster?
What are the drivers for deciding to move an instance from Sentinel to Cluster? Saturation is likely the primary consideration, but based on this, how do we decide whether to partition or move to the/a cluster?

The above questions will likely not have simple answers, as there are multiple considerations to make: running costs, operational complexity, ability for stage-groups to self-serve etc. Further, we won't be able to give confident answers until we gain confidence with Redis Cluster. However, having this conversation early can help set direction, create team alignment, and hopefully optimize design decisions.

Edited Feb 01, 2023 by Liam McAndrew