Improve incident Slack-based communications

Current Situation

Presently when an incident occurs, Slack threads start up in several places. This makes it difficult for an EOC/CMOC/IMOC who needs to build context during incident mitigation.

When active incidents are handed-off, it can be difficult to skim 3+ threads spread across several different channels.

From gitlab-com/www-gitlab-com#7945 (closed), which I've closed as a duplicate of this issue (even though that one came first!).

It was discussed that some of the participating members from dev escalation faced problems ensuring a clean hand off of communications between members. This problem is not specific to dev-escalation members, but can also be seen during incidents for Infrastructure teams as well. Specifically the following:

  • There are multiple areas that needed to be followed, many slack threads, comments in issues, zoom calls
  • Some conversation threads were started in other channels outside of #incident-management and #dev-escalation
  • Some communication was held within an identified MR or associated issue vs the incident issue

All of the above provided a lot of work for the next member to:

  • Read all threads that are known
  • Somehow learn what was discussed during a Zoom call
  • Discover what troubleshooting methods are known or have been attempted
  • Compare notes where information may be missing or obsolete as new information was discovered

The issues discussed above have all provided an additional amount of work on top of the on-going situation that required the assistance of various team members across various teams and functions.

Utilize this issue to discuss what we can do to update our communication standards and training to ensure we eliminate a communication as a bottle neck when attempting to work together on a high profile issue. Consider adopting similar policies for other teams as well where this is a common problem.

Desired Outcome

The incident notification in the #production and #incident-management Slack channels instruct GitLab Team Members on where to go to maintain incident communications.

For all incidents, GitLab team members are pointed to a Slack channel which has been created for the incident.

The handbook is also updated with information on how to find the channel and a link to the channel is available at the top of the incident description.

Acceptance Criteria

  • Necessary handbook updates are in place describing this workflow
  • Messaging is added to incident announcements to indicate procedure
  • Tooling is present to spin up an incident Slack channel
Edited by AnthonySandoval