Detect masterless shard in redis cluster and promote replica if viable

Redis Cluster natively attempts to detect a failed master node and promote one of its surviving replicas.

However, under certain conditions, the shard may have a surviving replica that the cluster cannot promote (e.g. due to lack of quorum among voting master nodes).

As a plausible concrete example, in the event of a zonal failure, if a majority of masters were in the failed zone, Redis Cluster will not auto-promote those shards' surviving replicas, even though a majority of nodes remain available in the surviving zones. This is due to Cluster V1's design choice to only allow master nodes to vote in failover events. Lacking quorum avoids split brain but sacrifices availability.

Redis Cluster lacks zone/rack awareness, but we can compensate for that using a fixed topology where each shard has 3 nodes (1 master + 2 replicas) statically spread among 3 separate zones. The remaining concern is that too many masters accumulate in any one zone.

As a mitigation to favor availability, we can detect when a shard lacks a master and manually promote a replica if redis is unable to do so after a configurable timeout.

Also, the same outcome -- a masterless shard with unpromotable healthy replicas -- can occur in other ways, such as a majority of masters concurrently restarting or becoming unreachable from the rest of the cluster (e.g. network partition). In any such circumstance, if a shard is masterless and at least one replica remains healthy, we want to promote it for the sake of cluster availability. And to avoid racing with redis, we will delay the forced promotion for long enough to let redis's native quorum-based promotion mechanism take action if it can; we only want to step in when redis cannot self-heal.

Edited Mar 02, 2023 by Matt Smiley