CMOC retrospective for incident #5194 - 2021-07-21
Summary
The CMOC was paged. CMOC acknowledged the page, and joined the incident Zoom room shortly after acknowledging the page. CMOC did not respond to verbal mentions in the Zoom room, nor did he respond to mentions in the Slack incident channel. This caused the IMOC and other involved parties to assume that CMOC was not present.
Unknown to everyone, CMOC was actually doing what he was supposed to do: actively listening to the conversation in the Zoom room, updating the status.io page in accordance with the recommended communication frequency for incidents and updating a GitLab.com customer emergency ticket which was related to the ongoing incident.
The cause was that CMOC was unaware that he was muted in the Zoom incident room, and that his macOS was accidentally configured to be permanently on Do Not Disturb mode resulting in Slack notifications not being received at his laptop.
Timeline
-
23:58
- Incident reported in #incident-management, #incident-5194 created, etc. -
00:01
- CMOC mentioned in Slack by Cindy -
00:13
- PagerDuty Emergencies paged by T-Mobile (Zendesk ticket #225516 - PagerDuty incident link) -
00:18
- Cindy DMs CMOC with link to original thread. -
00:19
- PagerDuty CMOC paged (PagerDuty incident link) -
00:19
- CMOC responds to Cindy's mention in Slack in thread -
00:20
- CMOC acknowledges PagerDuty pager via phone. -
00:24
- In the incident Zoom room:- CMOC reports hearing Steve saying "CMOC is here so we need to roll out a first update or something", and CMOC remembers responded "Working on it". This turned out not to have been heard.
- This is corroborated by Steve, who remembers CMOC joining the incident Zoom within 5 minutes of the incident being declared and also not hearing a response from CMOC after he verbally called out to CMOC on the call.
-
00:28
- Steve Loyd messages in Slack:also see the comms thread above ^ so far no response from
@CMOC
cc:@lyle
-
00:28
- CMOC updates status.io with "Investigating" update -
00:32
- Steve Loyd messages in Slack:@lyle
looks like status page is updated now, just not sure who did it.@CMOC
appears in the zoom call, but we've not heard anything (maybe having audio issues?) -
00:33
- CMOC replies to emergency ticket (https://gitlab.zendesk.com/agent/tickets/225516) -
00:33
- Lyle DMs CMOC -
00:33
- Lyle DMs Wei Meng -
00:37
- Kenneth joins the the incident response at or before this point (Slack message) -
00:44
- CMOC updates status.io with "identified" update.- Note: This was a routine status update, as CMOC set a timer to check every 10 minutes. (CMOC mentions that he spoke out on Zoom some time before or during this getting sent out).
-
00:45
- CMOC responds to Lyle's DMs -
00:50
- Wei Meng DMs CMOC -
00:56
- CMOC messages in Slack:sorry for the confusion team, my slack had issues when i clicked on the acknowledge for the pagerduty (edited)
-
01:01
- CMOC messages in Slack:The rollout has completed on two of our clusters. This will continue on our last few clusters and we will post further updates once completed. More details at gitlab-com/gl-infra/production#5194 (closed)” (edited)
@Steve Loyd
is this ok to update? -
01:04
- CMOC updates status.io with "identified" update -
01:12
- CMOC updates status.io with "monitoring" update -
01:39
- CMOC updates status.io with "resolved" updated
All times UTC.
Status.io updates: https://app.status.io/dashboard/5b36dc6502d06804c08349f7/incident/60f76a48e62f97053594d3bd/edit
Corrective Actions
Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.
- ...
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.
Click to expand or collapse the Incident Review section.
Incident Review
Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- The IMOC and SRE on-call were not able to focus fully on incident resolution and had to scramble to figure out customer communications.
- Four additional Support team members ended up being involved, even though it turned out that CMOC was performing the functions of the role just fine.
-
What was the experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- Everyone was confused initially as to what was happening, as CMOC appeared to have joined the call but did not respond to a verbal callout or subsequent Slack messages.
- Updates to status.io were sent by CMOC, but the incident responders were confused by this as their assumption was that CMOC was not present.
What were the root causes?
CMOC joined the Zoom call but did not realise he was muted.
- CMOC responded verbally to Steve in Zoom, but this was not heard by anyone in the room. This gave the impression that CMOC was not engaged in CMOC activities.
- Why wasn't CMOC heard? CMOC was muted on default in Zoom, and didn't realise he was still muted when he was speaking.
- Why didn't CMOC realise he was muted? CMOC did not realise he was muted as he had windows overlapping the Zoom call window, and was busy reviewing the issues, actively listening to the incident Zoom and updating status.io. CMOC usually also reviews past incidents in status.io which were similar to the current incident to get an idea of how to best phrase updates. CMOC mentioned that he did not hear any further mentions of his name on the Zoom call until he realised he was muted.
- Why did CMOC finally realise he was muted? CMOC noticed that Kenneth had joined the call, and thought that maybe he wanted to shadow and that's when CMOC thought that something was not quite right with the Zoom call.
- Why didn't CMOC realise he was muted? CMOC did not realise he was muted as he had windows overlapping the Zoom call window, and was busy reviewing the issues, actively listening to the incident Zoom and updating status.io. CMOC usually also reviews past incidents in status.io which were similar to the current incident to get an idea of how to best phrase updates. CMOC mentioned that he did not hear any further mentions of his name on the Zoom call until he realised he was muted.
- Why wasn't CMOC heard? CMOC was muted on default in Zoom, and didn't realise he was still muted when he was speaking.
CMOC had macOS notifications disabled due to a Do Not Disturb setting
- CMOC did not receive Slack notifications on desktop and he did not have his phone by his side. This resulted in not all Slack mentions being noticed by CMOC, any that were noticed was noticed visually while the Slack window was uncovered as CMOC moved windows around while responding to the incident.
- Why weren't notifications received? Do Not Disturb was set to be on 24 hours a day in macOS
System Preferences > Notifications
, from 12am to 12am.- Why was DND configured this way? When CMOC was watching a movie 2 months ago, he turned on DND through system preferences that way and did not turn it off.
- Why didn't CMOC notice this earlier? Because CMOC usually has his phone by his side when working and he continued to receive notifications through his phone and didn't think that something was wrong.
- Why was DND configured this way? When CMOC was watching a movie 2 months ago, he turned on DND through system preferences that way and did not turn it off.
- Why weren't notifications received? Do Not Disturb was set to be on 24 hours a day in macOS
Lessons Learned
- CMOCs should verify that their presence is acknowledge and that they can be heard on the incident Zoom room before proceeding to do any other substantial CMOC task.
- CMOCs should ensure that Slack notifications are working properly when they are on shift.
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)