Understand bad failover in bitnami/redis chart
Spun out of #1356 (comment 714561398).
During tests of the bitnami/redis helm chart, we ran into a botched failover where a replica attempts to connect to itself. This condition takes quite a while to fix itself.
We should better understand why this happens, what the availability and durability impact is, and whether we can avoid it from happening.
From the original issue:
One troubling behaviour I discovered while trying to manually trigger a failover is that the former primary sometimes gets into a bad state where it tries to connect to itself, producing this type of error message in a loop:
my-release-redis-node-0 redis 1:S 26 Oct 2021 14:15:55.346 * Connecting to MASTER 172.17.0.3:6379
my-release-redis-node-0 redis 1:S 26 Oct 2021 14:15:55.346 * MASTER <-> REPLICA sync started
my-release-redis-node-0 redis 1:S 26 Oct 2021 14:15:55.346 * Non blocking connect for SYNC fired the event.
my-release-redis-node-0 redis 1:S 26 Oct 2021 14:15:55.347 * Master replied to PING, replication can continue...
my-release-redis-node-0 redis 1:S 26 Oct 2021 14:15:55.348 * Trying a partial resynchronization (request 2d17bcb1f252940866922b5c9c887c2184c07f0e:498634).
my-release-redis-node-0 redis 1:S 26 Oct 2021 14:15:55.349 * Master is currently unable to PSYNC but should be in the future: -NOMASTERLINK Can't SYNC while not connected with my master
In this case, sentinel fixed the situation after about 20 seconds, informing the node of who the new primary is:
my-release-redis-node-0 sentinel 1:X 26 Oct 2021 14:16:54.023 * +fix-slave-config slave 172.17.0.3:6379 172.17.0.3 6379 @ mymaster my-release-redis-node-2.my-release-redis-headless.default.svc.cluster.local 6379
And then it successfully connects:
my-release-redis-node-0 redis 1:S 26 Oct 2021 14:16:54.023 * Connecting to MASTER my-release-redis-node-2.my-release-redis-headless.default.svc.cluster.local:6379
my-release-redis-node-0 redis 1:S 26 Oct 2021 14:16:54.025 * MASTER <-> REPLICA sync started
my-release-redis-node-0 redis 1:S 26 Oct 2021 14:16:54.025 * REPLICAOF my-release-redis-node-2.my-release-redis-headless.default.svc.cluster.local:6379 enabled (user request from 'id=320 addr=172.17.0.3:52274 laddr=172.17.0.3:6379 fd=14 name=sentinel-fc43e080-cmd age=61 idle=0 flags=x db=0 sub=0 psub=0 multi=4 qbuf=263 qbuf-free=40691 argv-mem=4 obl=45 oll=0 omem=0 tot-mem=61468 events=r cmd=exec user=default redir=-1')
my-release-redis-node-0 redis 1:S 26 Oct 2021 14:16:54.027 * Non blocking connect for SYNC fired the event.
In this particular case we might survive it, since clients will connect to the new primary. But I'd like to understand this behaviour a bit better, and see if there is a fix, or at least an upstream issue we could file about it.