2020-06-18: Redis down on redis-sidekiq-03

Summary

After high load the machine rebooted.

One redis sidekiq replica is down.

All times UTC.

2020-06-18

14:06 - elevated mailer queue on catchall sidekiq fleet (maybe related?)
14:06 - elevated keys rate of change on redis-sidekiq
14:13 - failed replica disconnected from primary log entry (on primary)
14:13 - Sentinel down alert: https://gitlab.pagerduty.com/incidents/P2083FE?utm_source=slack&utm_campaign=channel
14:19 - redis-sidekiq-03 rebooting
14:33 - Redis metrics missing alert: https://gitlab.pagerduty.com/incidents/P4PR9GZ?utm_source=slack&utm_campaign=channel
14:40 - hphilipps declares incident in Slack using /incident declare command.
14:48 - redis and sentinel services manually started by EOC, replica back after partial sync

Click to expand or collapse the Incident Review section.

Who was impacted by this incident? (i.e. external customers, internal customers)
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
How many customers were affected?
If a precise customer impact number is unknown, what is the estimated potential impact?

How was the root cause diagnosed?
How could time to diagnosis be improved?
Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?