GitLab Simulation Day #3 Plan
Overview
An internal customer of the Category:Incident Management is the GitLab SRE team. We are maturing this category by adding features and functionality that can be directly dogfooded.
Incident Management involves highly critical workflows that must be highly reliable. In other words, before the SRE team can begin using GitLab to triage alert and respond to Incidents, we need to build add more important functionality and demonstrate that GitLab can be used reliably.
As we continue to develop product in these categories aimed at enabling the SRE team to dogfood the product, we will be running Simulation Days (also called Game Days).
Purpose
The purpose of this issue is to design and plan the second Simulation Day for the Monitor:Health team and the SRE team to use GitLab for alert and incident management.
Goal
Host one simulation day in FY21Q4.
Participants
- @brentnewton - Director of Infrastructure, Reliability
- @AnthonySandoval - Engineering Manager, Reliability
- @kbychu - GPM for Monitor
- @sgoldstein - Director of Engineering for Ops
- @crystalpoole - Engineering Manager for Monitor:Health
Plan
- Schedule simulation day - January 2021
- Identify end-to-end workflow we will be testing in the first game day. (See Google Doc)
- Record game day
- Identify improvements based on game day and schedule