Accidental reboot of redis-cache-03, followed by failover of redis-cache-01
Summary
Accidental reboot of redis-cache-03, followed by failover of redis-cache-01
Timeline
All times UTC.
2020-03-31
- 13:13:56 - Accidental shutdown of redis-cache-03 (replica)
- 13:14:06 - Sentinels detect +sdown
- 13:19:58 - Redis process on redis-cache-03 starts coming back up
- 13:19:59 - Sentinels detect -sdown and +reboot
- 13:23:03 - redis-cache-03 starts full resync from master
- 13:23:13 - Sentinels perform failover from redis-cache-01 to redis-cache-02 based on short-lived +odown of redis-cache-01
- 13:23:15 - redis-cache-01 starts full resync from master (which is now redis-cache-02)
- 13:23:15 - redis-cache-03 starts full resync from master (which is now redis-cache-02)
- 13:34:17 - redis-cache-01 finishes rdb dump (possibly was still running from when it was master?)
- 13:42:20 - redis-cache-03 starts loading data received from master (redis-cache-02)
- 13:42:18 - redis-cache-01 starts loading data received from master (redis-cache-02)
- 13:48:34 - redis-cache-03 finished recovering
- 13:53:26 - redis-cache-01 finished recovering
- 13:49:00 - PagerDuty alert fires
- 14:32:00 - Incident declared from Slack
Details
As part of the work on #1866 (closed), which was scoped to staging only, I accidentally shut down a production redis replica. This also appears to have triggered a failover of the primary (redis-cache-01 failed over to redis-cache-02).
Source
Incident declared by iwiedler in Slack via /incident declare
command.
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)
Edited by Matt Smiley