Incident Review: Email Delays

Incident Review

The DRI for the incident review is the issue assignee.

If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included.
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident.
Fill out relevant sections below or link to the meeting review notes that cover these topics
If there is a need to schedule a synchronous review, complete the following steps:
- In this issue, @ mention the EOC, IMOC and other parties who were involved that we would like to schedule a sync review discussion of this issue.
- Schedule a meeting that works the best for those involved, in the agenda put a link to this review issue. The meeting should primarily discuss what is already documented in this issue, and any questions that arise from it.
- Ensure that the meeting is recorded, when complete upload the recording to GitLab unfiltered.

Who was impacted by this incident? (i.e. external customers, internal customers)
1. External customers
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. 2FA via email would prevent access to the platform
2. Delayed email deliveries, i.e for notifications on issues and topics
How many customers were affected?
1. Unknown
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. Unknown, as we wernt monitoring this metric at the time

Our mail service provider was not able to deliver emails to given mail providers (rejected downstream), this causes an exponential backoff delay.
The reason for this rejection is unknown - although it is theorized that this is because of the vast quantity of emails we were sending.

How was the incident detected?
1. Customer reports surfacing via slack
How could detection time be improved?
1. Onboarding our email delivery as a service into our metrics platform would give us the ability to automatically detect those errors.
How was the root cause diagnosed?
1. Checking through logs from mailgun, combined with some excellent diagnostic work from @stanhu here
How could time to diagnosis be improved?
1. Onboarding those metrics to our internal monitoring platform would've given us the visibility to more correctly diagnose these
How did we reach the point where we knew how to mitigate the impact?
1. Correspondence with our mail provider via a ticketing system
How could time to mitigation be improved?
1. Runbook entry to accompany automated alerts; Although in this specific situation much of the time to getting to mitigated was waiting on processes we couldnt control

Did we have other events in the past with the same root cause?
1. None that i could see
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. no
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. n/a

Edited Feb 21, 2024 by Calliope Gardner