Identify, detect and handle failures during Redis Cluster scaling and resharding

This issue is for discussion on failure modes during scaling and resharding:

Possible failure modes

Redis server CPU saturation due to extra work of migrating keys
webservices and/or sidekiq service apdex impacted by resharding due to redirection
Nodes crashing leading to irrecoverable/difficult-to-recover states during slot migration (https://github.com/redis/redis/issues/6339)

Detection

We already have metrics and alerts to track

SLIs
Redis service's component saturation (cpu, memory)
rates of redirection (MOVED vs ASK)

Handling failures

Pausing resharding
Reverting resharded key slots, effectively undoing the migration either partially or fully
Adjusting rate of key slot migration (most likely slowing down)
Fix slot states -- (TODO: we need a playbook of how to do so)

Edited Jan 30, 2024 by Sylvester Chin