Incident Review: redis-cluster-chat-cache traffic cessation alert

Key Information

Metric	Value
Customers Affected	No
Requests Affected	No
Incident Severity	~Severity::4
Start Time	2025-01-23 16:00
End Time	2025-01-23 17:30
Total Duration	90 mins
Link to Incident Issue	#19161 (closed)

Summary

The traffic volume to the Chat Redis instance become zero. EOC suspected an active incident that affects end-users, hence, they reverted a merge request that caused the change.

Details

EOC got pinged by an alert that the traffic volume to the Chat Redis instance become zero. Declared an incident.
Discussed in the incident Slack channel #incident-19161. An incident was escalated to S2 from S3 on a hunch.
Incident managers reached out Duo Chat / AI Framework group to get a further insight. Identified the merge request that caused the change and reverted it.
Mitigated.

Outcomes/Corrective Actions

Update the metadata of the existing alerts on Runbook to look at the Duo Chat runbook for the triage process.
Prepare Change Request for dropping the traffic from the Chat Redis instance.

Learning Opportunities

What went well?

We quickly identified the merge request triggered the alert and reverted it to unblock the auto-deploy as soon as possible.

What was difficult?

The Duo Chat runbook was not reachable from the triggered alert, which was crucial part for triaging an incident.
The incident happened at a time when nobody with expertise on the service was available.
Change Request was not created during the feature flag rollout which could have been helpful to identify the issue.

Review Guidelines

This review should be completed by the team which owns the service causing the alert. That team has the most context around what caused the problem and what information will be needed for an effective fix. The EOC or IMOC may create this issue, but unless they are also on the service owning team, they should assign someone from that team as the DRI.

For the person opening the Incident Review

Set the title to Incident Review: (Incident issue name)
Assign a Service::* label (most likely matching the one on the incident issue)
Set a Severity::* label which matches the incident
In the Key Information section, make sure to include a link to the incident issue
Find and Assign a DRI from the team which owns the service (check their slack channel or assign the team's manager) The DRI for the incident review is the issue assignee.
Announce the incident review in the incident channel on Slack.

:mega: @here An incident review issue was created for this incident with <USER> assigned as the DRI.
If you have any review feedback please add it to <ISSUE_LINK>.

For the assigned DRI

Fill in the remaining fields in the Key Information section, using the incident issue as a reference. Feel free to ask the EOC or other folks involved if anything is difficult to find.
If there are metrics showing Customers Affected or Requests Affected, link those metrics in those fields
Create a few short sentences in the Summary section summarizing what happened (TL;DR)
Use the description section to write a few paragraphs explaining what happened
Link any corrective actions and describe any other actions or outcomes from the incident
Consider the implications for self-managed and Dedicated instances. For example, do any bug fixes need to be backported?
Add any appropriate labels based on the incident issue and discussions
Once discussion wraps up in the comments, summarize any takeaways in the details section
Close the review before the due date

Edited Feb 10, 2025 by Shinya Maeda