Incorrect Redis HA configuration leads to non-starting cluster
When Redis is running in HA mode, using Sentinel, the redis.conf file should not specify the slaveof configuration.
Currently it is specified. On a full restart of the cluster, this setup leads to a misconfigured state which requires manual intervention to move beyond.
On the "supposed" Redis master, we see the following in the logs:
018-07-17_10:04:06.50021 5077:S 17 Jul 10:04:06.500 * Connecting to MASTER 10.224.7.101:6379
2018-07-17_10:04:06.50025 5077:S 17 Jul 10:04:06.500 * MASTER <-> SLAVE sync started
2018-07-17_10:04:06.50031 5077:S 17 Jul 10:04:06.500 * Non blocking connect for SYNC fired the event.
2018-07-17_10:04:06.50039 5077:S 17 Jul 10:04:06.500 * Master replied to PING, replication can continue...
2018-07-17_10:04:06.50044 5077:S 17 Jul 10:04:06.500 * Partial resynchronization not possible (no cached master)
2018-07-17_10:04:06.50049 5077:S 17 Jul 10:04:06.500 * Master does not support PSYNC or is in error state (reply: -ERR Can't SYNC while not connected with my master)
2018-07-17_10:04:06.50050 5077:S 17 Jul 10:04:06.500 * Retrying with SYNC...
2018-07-17_10:04:06.50060 5077:S 17 Jul 10:04:06.500 # MASTER aborted replication with an error: ERR Can't SYNC while not connected with my master
There are several problems here:
-
redis.confspecifies the following configuration:slaveof 10.224.7.101 6379.10.224.7.101is the current host. The host will not start since it has been configured to replicate off itself. - The replication configuration is managed on behalf of Redis by Sentinel. Configuring a static master in the
redis.confwill quickly get out-of-date after failover. It is better to leave theslaveofconfiguration out and allow Sentinel to configure the Redis instances.
The problem was fixed by manually removing the slaveof configuration and restarting the Redis instances in the cluster. Once this was done, sentinel quickly recovered the cluster, as seen in the logs:
2018-07-17_10:04:08.25877 14856:M 17 Jul 10:04:08.258 * DB loaded from disk: 0.454 seconds
2018-07-17_10:04:08.25880 14856:M 17 Jul 10:04:08.258 * The server is now ready to accept connections on port 6379
2018-07-17_10:04:08.35224 14856:M 17 Jul 10:04:08.352 * Slave 10.224.7.102:6379 asks for synchronization
2018-07-17_10:04:08.35227 14856:M 17 Jul 10:04:08.352 * Full resync requested by slave 10.224.7.102:6379
2018-07-17_10:04:08.35227 14856:M 17 Jul 10:04:08.352 * Starting BGSAVE for SYNC with target: disk
2018-07-17_10:04:08.35485 14856:M 17 Jul 10:04:08.354 * Background saving started by pid 14859
2018-07-17_10:04:08.36843 14856:M 17 Jul 10:04:08.368 * Slave 10.224.7.103:6379 asks for synchronization
2018-07-17_10:04:08.36845 14856:M 17 Jul 10:04:08.368 * Full resync requested by slave 10.224.7.103:6379
2018-07-17_10:04:08.36845 14856:M 17 Jul 10:04:08.368 * Waiting for end of BGSAVE for SYNC
Edited by Andrew Newdigate