Redis failover inhibited by residual phantom sentinel voters
Problem: Sentinels disagree about how many votes constitute a majority
Normally we run 3 sentinels for each redis cluster, and we expect any 2 of them to be able to trigger a redis failover, following the rule that a majority vote avoids the risk of long-lived split-brain.
However, unexpectedly, losing 1 sentinel (or the host it runs on) would prevent some of the surviving sentinels from being able to reach the majority required to authorize a redis failover.
This fragile state is due to the sentinels having a different expectation about the number of voters and consequently the threshold for reaching a majority.
Under normal conditions, we expect 3 sentinels to be able to vote. But currently our 3 sentinels respectively think that there are 3, 4, or 5 voters:
msmiley@saoirse:~$ mussh -m -b -h redis-{01..03}-db-gprd.c.gitlab-production.internal -c '/opt/gitlab/embedded/bin/redis-cli -p 26379 sentinel master gprd-redis | grep -A1 "num-other-sentinels"'
redis-01-db-gprd.c.gitlab-production.internal: num-other-sentinels
redis-01-db-gprd.c.gitlab-production.internal: 4
redis-02-db-gprd.c.gitlab-production.internal: num-other-sentinels
redis-02-db-gprd.c.gitlab-production.internal: 3
redis-03-db-gprd.c.gitlab-production.internal: num-other-sentinels
redis-03-db-gprd.c.gitlab-production.internal: 2
Consequently, 2 of our sentinels require a minimum of 3 votes to reach a majority. That inaccuracy means that failing a random 1 of the 3 redis hosts has a 67% chance of preventing the sentinels from performing failover if one was needed. The split-brain protection afforded by requiring a majority to authorize failover is now overly cautious and would refuse to failover even if 2 of the 3 real voters were in agreement.
Corrective actions
We need to add detection and prevention for this condition.
Detection: Metric and alerting
For detection, we could make a prometheus metric that connects to redis sentinel (port 26379) and runs:
SENTINEL MASTER <master_group_name>
or
SENTINEL MASTERS
Extract its output field num-other-sentinels. Alert if that field is anything other than 2 for more than a few minutes:
- Being greater than 2 may mean dead phantom sentinels are still considered voters. (This is the current normal state as of today that we want to fix and prevent.)
- Being less than 2 may mean that we have unexpectedly lost a sentinel and then explicitly told the remaining sentinels to forget about inactive peer sentinels.
Prevention
I suspect this condition is implicitly induced by our maintenance procedure and that we could avoid it by adding a step to remove the phantom sentinels from the voting roster.
To tell a sentinel to forget about all its historical peers and rediscover only the currently live peers, we can run:
SENTINEL RESET *
See this section of the Redis Sentinel docs: https://redis.io/topics/sentinel#adding-or-removing-sentinels