Improve Incident Management process for large scale incident
Problem to solve
When facing incidents of large scale, it might be necessary to ask for help and additional EOC/IMOC/CMOC to join and assist the incident resolution.
The current Incident Management process doens't provide guidance or tooling to efficiently scale the involved team and organize the operations.
Proposal
Following the feedback from Incident Review for Site-wide Outage for GitLab... (production#15999 - closed), let's identify how we can improve the process and tooling.
- emphasize the opportunity to ask for more help from Infrastructure team. Bringing people with deeper understanding of the infrastructure could help speeding up time to mitigation. An MR has already been started for this: gitlab-com/www-gitlab-com!126910 (merged)
- when the need to split operations into separate group of people arises, we should be able to quickly spin up a new zoom room (recorded) but maintain cohesion and coordination with the main one.
- improve Incident Management process page with additional guidance
- improve tooling (e.g. woodhouse command to create a zoom room)