Discussion: Paved roads for Redis workload migration

This issue summarises the state of Redis workload migration and discusses the need for paved roads to reduce the toil of such workload migrations. It contributes to the overarching discussion in #2155

Problem

In the past year, we have migrated a number of workloads of varying complexity. Just to name a few:

feature flag migration out of Cache (we could not use MultiStore for this since there is a recursive dependency)
direct migration from Redis Sentinel to Redis Cluster (rate-limiting, cache, shared state, repository cache)
functional sharding of repository-cache, db-load-balancing, workhorse and actioncable

These migrations range from simple to very complex.

Simple: using MultiStore where we dual-write and switch read traffic over to the new Redis after waiting a period of time to allow keys written before the migration start to expire
Complex: using MultiStore is insufficient as the workload will never converge due to the data type (counters or lists) or unrealistically long TTLs (1-2 weeks). Such migrations would necessitate an external script to sync up the data.

How we have been doing migrations

Ignoring provisioning work, a typical Redis workload migrations would entail the following:

Make an application-side change to incorporate MultiStore usage. This usually means adding a temporary Wrapper class that connects to the target instance.
Update chef and k8s-workloads to configure the temporary Wrapper class.
Perform the migration as part of a change request by enabling the feature flags.
Run the external migration and validation script if the workload requires.

Following a successful migration, we would need to perform clean up where step 1 is reverted and the unused configurations are removed.

See &1236 (closed) as an example on the work required.

Step 2 seems unavoidable but step 1 could be avoided if there is a way to configure the migration class directly.

Considerations

Note: some of the pointers does not contribute directly to Scalability's goal of scaling SaaS Platforms but worth noting.

Dedicated: Dedicated currently runs 1 Redis instance per tenant. Realistically speaking, large tenants may need to separate cache from persistent when they scale to a certain point.

Cells: Similar to above. We are not settled on the Redis configuration for cells but it is unlikely to be as sharded (1 instance per logical wrapper class) as .com.

SM users: If SM users wish to migrate workloads, they could opt for a deployment with some downtime. I'll not go into more detail since this if out of scope for SaaS platforms.

.com: we have 2 workloads which could be due for Redis Cluster migration (sessions and db load balancing) and 1 workload which could be due for further functional sharding (workhorse and actioncable since they share the same Redis instance right now).

Proposal

I propose that we improve simple workload migrations to the point where they can be self-served while providing assistance for complex migrations as domain experts.

Some ideas could be:

Move towards using only redis.yml for .com. This will simplify the configuration process and there is less confusion as to use redis.xxx.yml or redis.yml.
Enable multistore-based migration in wrapper class using redis.yml. The advantage is that future workload migrations would not require a merge-request. This means any users (cells/dedicated/SM) will be able to migrate using a MultiStore.

Edited Apr 09, 2024 by Sylvester Chin