Incident Review: Email Delays
Incident Review
The DRI for the incident review is the issue assignee.
-
If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included. -
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident. -
Fill out relevant sections below or link to the meeting review notes that cover these topics -
If there is a need to schedule a synchronous review, complete the following steps: -
In this issue, @
mention the EOC, IMOC and other parties who were involved that we would like to schedule a sync review discussion of this issue. -
Schedule a meeting that works the best for those involved, in the agenda put a link to this review issue. The meeting should primarily discuss what is already documented in this issue, and any questions that arise from it. -
Ensure that the meeting is recorded, when complete upload the recording to GitLab unfiltered.
-
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- External customers
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- 2FA via email would prevent access to the platform
- Delayed email deliveries, i.e for notifications on issues and topics
-
How many customers were affected?
- Unknown
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- Unknown, as we wernt monitoring this metric at the time
What were the root causes?
- Our mail service provider was not able to deliver emails to given mail providers (rejected downstream), this causes an exponential backoff delay.
- The reason for this rejection is unknown - although it is theorized that this is because of the vast quantity of emails we were sending.
Incident Response Analysis
-
How was the incident detected?
- Customer reports surfacing via slack
-
How could detection time be improved?
- Onboarding our email delivery as a service into our metrics platform would give us the ability to automatically detect those errors.
- How was the root cause diagnosed?
-
How could time to diagnosis be improved?
- Onboarding those metrics to our internal monitoring platform would've given us the visibility to more correctly diagnose these
-
How did we reach the point where we knew how to mitigate the impact?
- Correspondence with our mail provider via a ticketing system
-
How could time to mitigation be improved?
- Runbook entry to accompany automated alerts; Although in this specific situation much of the time to getting to mitigated was waiting on processes we couldnt control
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- None that i could see
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- no
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- n/a
What went well?
- ...
Guidelines
Edited by Calliope Gardner