2020-03-24 Sidekiq not meeting latency SLOs
Summary
A burst of emails generated for one customer moving multiple projects around caused nearly 100K project_was_moved_email
jobs. In attempting to deliver these the outbound NAT pool was saturated, leading to timeouts, which lead to the alert about latency apdex.
Timeline
All times UTC.
2020-03-24
- 01:43 - Project moves began
- 01:52 - Number of emails began ramping up
- 01:56 - NAT gateway errors began occurring at substantially more than the usual baseline rate
- 02:02 - Pagerduty alert received for the SLO apdex, investigations begin
- 02:12 - Pagerduty alert resolves without active intervention, as the queue drops
- 02:15 - E-mails stop being generated (project move activity has ceased)
Graphs/data will be added as comments
ServiceSidekiq ~S4
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)