2020-06-18: Redis down on redis-sidekiq-03
Summary
After high load the machine rebooted.
redis down on redis-sidekiq-03
One redis sidekiq replica is down.
Timeline
All times UTC.
2020-06-18
- 14:06 - elevated mailer queue on catchall sidekiq fleet (maybe related?)
- 14:06 - elevated keys rate of change on redis-sidekiq
- 14:13 - failed replica disconnected from primary log entry (on primary)
- 14:13 - Sentinel down alert: https://gitlab.pagerduty.com/incidents/P2083FE?utm_source=slack&utm_campaign=channel
- 14:19 - redis-sidekiq-03 rebooting
- 14:33 - Redis metrics missing alert: https://gitlab.pagerduty.com/incidents/P4PR9GZ?utm_source=slack&utm_campaign=channel
- 14:40 - hphilipps declares incident in Slack using
/incident declare
command. - 14:48 - redis and sentinel services manually started by EOC, replica back after partial sync
Click to expand or collapse the Incident Review section.
Incident Review
Summary
- Service(s) affected:
- Team attribution:
- Minutes downtime or degradation:
Metrics
Customer Impact
- Who was impacted by this incident? (i.e. external customers, internal customers)
- What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- How many customers were affected?
- If a precise customer impact number is unknown, what is the estimated potential impact?
Incident Response Analysis
- How was the event detected?
- How could detection time be improved?
- How did we reach the point where we knew how to mitigate the impact?
- How could time to mitigation be improved?
Post Incident Analysis
- How was the root cause diagnosed?
- How could time to diagnosis be improved?
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
- Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?
5 Whys
Lessons Learned
Corrective Actions
Guidelines
Edited by Henri Philipps