Identify, detect and handle failures during Redis Cluster scaling and resharding
This issue is for discussion on failure modes during scaling and resharding:
Possible failure modes
- Redis server CPU saturation due to extra work of migrating keys
- webservices and/or sidekiq service apdex impacted by resharding due to redirection
- Nodes crashing leading to irrecoverable/difficult-to-recover states during slot migration (https://github.com/redis/redis/issues/6339)
Detection
We already have metrics and alerts to track
- SLIs
- Redis service's component saturation (cpu, memory)
- rates of redirection (MOVED vs ASK)
Handling failures
- Pausing resharding
- Reverting resharded key slots, effectively undoing the migration either partially or fully
- Adjusting rate of key slot migration (most likely slowing down)
- Fix slot states -- (TODO: we need a playbook of how to do so)
Edited by Sylvester Chin