Accidental reboot of redis-cache-03, followed by failover of redis-cache-01

Summary

Timeline

All times UTC.

2020-03-31

13:13:56 - Accidental shutdown of redis-cache-03 (replica)
13:14:06 - Sentinels detect +sdown
13:19:58 - Redis process on redis-cache-03 starts coming back up
13:19:59 - Sentinels detect -sdown and +reboot
13:23:03 - redis-cache-03 starts full resync from master
13:23:13 - Sentinels perform failover from redis-cache-01 to redis-cache-02 based on short-lived +odown of redis-cache-01
13:23:15 - redis-cache-01 starts full resync from master (which is now redis-cache-02)
13:23:15 - redis-cache-03 starts full resync from master (which is now redis-cache-02)
13:34:17 - redis-cache-01 finishes rdb dump (possibly was still running from when it was master?)
13:42:20 - redis-cache-03 starts loading data received from master (redis-cache-02)
13:42:18 - redis-cache-01 starts loading data received from master (redis-cache-02)
13:48:34 - redis-cache-03 finished recovering
13:53:26 - redis-cache-01 finished recovering
13:49:00 - PagerDuty alert fires
14:32:00 - Incident declared from Slack

Details

As part of the work on #1866 (closed), which was scoped to staging only, I accidentally shut down a production redis replica. This also appears to have triggered a failover of the primary (redis-cache-01 failed over to redis-cache-02).

Source

Incident declared by iwiedler in Slack via /incident declare command.

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Edited Mar 30, 2020 by Matt Smiley