Redis failover inhibited by residual phantom sentinel voters

Problem: Sentinels disagree about how many votes constitute a majority

Normally we run 3 sentinels for each redis cluster, and we expect any 2 of them to be able to trigger a redis failover, following the rule that a majority vote avoids the risk of long-lived split-brain.

However, unexpectedly, losing 1 sentinel (or the host it runs on) would prevent some of the surviving sentinels from being able to reach the majority required to authorize a redis failover.

This fragile state is due to the sentinels having a different expectation about the number of voters and consequently the threshold for reaching a majority.

Under normal conditions, we expect 3 sentinels to be able to vote. But currently our 3 sentinels respectively think that there are 3, 4, or 5 voters:

msmiley@saoirse:~$ mussh -m -b -h redis-{01..03}-db-gprd.c.gitlab-production.internal -c '/opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel master gprd-redis | grep -A1 "num-other-sentinels"'
redis-01-db-gprd.c.gitlab-production.internal: num-other-sentinels
redis-01-db-gprd.c.gitlab-production.internal: 4
redis-02-db-gprd.c.gitlab-production.internal: num-other-sentinels
redis-02-db-gprd.c.gitlab-production.internal: 3
redis-03-db-gprd.c.gitlab-production.internal: num-other-sentinels
redis-03-db-gprd.c.gitlab-production.internal: 2

Consequently, 2 of our sentinels require a minimum of 3 votes to reach a majority. That inaccuracy means that failing a random 1 of the 3 redis hosts has a 67% chance of preventing the sentinels from performing failover if one was needed. The split-brain protection afforded by requiring a majority to authorize failover is now overly cautious and would refuse to failover even if 2 of the 3 real voters were in agreement.

Corrective actions

We need to add detection and prevention for this condition.

Detection: Metric and alerting

For detection, we could make a prometheus metric that connects to redis sentinel (port 26379) and runs:

SENTINEL MASTER <master_group_name>

SENTINEL MASTERS

Extract its output field num-other-sentinels. Alert if that field is anything other than 2 for more than a few minutes:

Being greater than 2 may mean dead phantom sentinels are still considered voters. (This is the current normal state as of today that we want to fix and prevent.)
Being less than 2 may mean that we have unexpectedly lost a sentinel and then explicitly told the remaining sentinels to forget about inactive peer sentinels.

Prevention

I suspect this condition is implicitly induced by our maintenance procedure and that we could avoid it by adding a step to remove the phantom sentinels from the voting roster.

To tell a sentinel to forget about all its historical peers and rediscover only the currently live peers, we can run:

SENTINEL RESET *

See this section of the Redis Sentinel docs: https://redis.io/topics/sentinel#adding-or-removing-sentinels