Incident Management Simulation Day with the SRE and Monitor:Health teams
The Monitor: Health team owns Category:Alert Management (
planned) and Category:Incident Management (
viable). Internal customers for these categories include members of the SRE team. As we work to mature these categories into products that will be loved in market, one of the first steps we will need to take is building something that our internal customers can dogfood.
Alert Management and Incident Management involve highly critical workflows that must be highly reliable for. In other words, before the SRE team can begin using GitLab to triage alert and respond to Incidents, we need to build add more important functionality and demonstrate that GitLab can be used reliably.
As we continue to develop product in these categories aimed at enabling the SRE team to dogfood the product, we will be running Simulation Days (also called Game Days).
The purpose of this issue is to design and plan the first Simulation Day for the Monitor:Health team and the SRE team to use GitLab for alert and incident management.
Host one simulation day in FY21Q2.
- @sarahwaldner - Sr. PM for Monitor: Health
- @sgoldstein - Director of Engineering for Ops
- @brentnewton - Director of Infrastructure, Reliability
- @ClemMakesApps - Frontend Engineering Manager for Monitor:Health
- @crystalpoole - Backend Engineering Manager for Monitor:Health
Plan & To-dos
- Complete synchronous brain-storming session with stakeholders to kick-off planning. Capture notes and link to this issue
- Identify end-to-end workflow we will be testing in the first game day. Create visual in MURAL and link to this issue.
- Identify critical functionality required for the initial end-to-end workflow. Document list of features and link to this issue.