Incident Review: 2024-06-05 page styling broken in staging-canary causing all QA tests to fail
Incident Review
The DRI for the incident review is the issue assignee.
-
Announce the incident review in the incident channel on Slack.
:mega: @here An incident review issue was created for this incident with <USER> assigned as the DRI.
If you have any review feedback please add it to <ISSUE_LINK>.
-
If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included. -
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident. -
Fill out relevant sections below or link to the meeting review notes that cover these topics - If there is a need to schedule a synchronous review, complete the following steps:
-
In this issue, @
mention the EOC, IMOC and other parties who were involved that we would like to schedule a sync review discussion of this issue. -
Schedule a meeting that works the best for those involved, in the agenda put a link to this review issue. The meeting should primarily discuss what is already documented in this issue, and any questions that arise from it. -
Ensure that the meeting is recorded, when complete upload the recording to GitLab unfiltered.
-
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- Internal only - deployment blocker for more than 1 day.
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- Potential impact only. Deploys were blocked for more than 1 day.
-
How many customers were affected?
- ...
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- ...
What were the root causes?
- Core problems - assets deployed to staging canary were causing all smoke tests to fail.
- Assets generation tooling change that we needed to adapt to per #18107 (comment 1938804810)
Incident Response Analysis
-
How was the incident detected?
- Human declared incident per failures in staging-canary env (in test env good)
-
How could detection time be improved?
- ...
-
How was the root cause diagnosed?
- Significant research from Staff+ engineering in development.
-
How could time to diagnosis be improved?
- ...
-
How did we reach the point where we knew how to mitigate the impact?
- ...
-
How could time to mitigation be improved?
- ...
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- ...
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- ...
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- ...
What went well?
- ...
Guidelines
Edited by Dave Smith