Facilitate Infrastructure team's gap analysis of Incident Management => 100%
Purpose
Purpose of this issue is to capture the results of a gap analysis to identify what features and functionality Gitlab Incident management is missing so that the GitLab SRE team can use it entirely in place of PagerDuty to maintain GitLab.com.
Outcome
Completed Gap Analysis: https://gitlab.com/gitlab-org/monitor/health/-/issues/36
Background
The GitLab SRE team and the Monitor:Health team have the joint ambitious goal to have the GitLab SRE team dogfooding Incident management entirely by the end of FY21Q4. This requires an understanding of the features and functionality that GitLab's current incident management offering is missing so that the GitLab SRE can transition smoothly and fully.
Disclaimer: The GitLab SRE team will never be asked to dogfood a tool that makes their jobs more difficult and causes a degradation in their performance. It is the mission of the Monitor:Health team to build an Incident Management platform that is satisfactory and supports them fully in their responsibilities.
Plan
- @sarahwaldner and @crystalpoole will observe and take notes of the SREs during the 2nd simulation day.
- @crystalpoole will prepare the gap analysis and circulate for review by the SRE team
- @sarahwaldner will create issues for building the missing functionality and schedule it
Notes
Google doc with notes from simulation day
Status
2020-10-06
Gap Analysis has been drafted and is awaiting a final review by the Infrastructure group. The outcome is a roadmap where SRE production alerts can be integrated within 2 milestones and a full cutover of GitLab's incident managment process could be achieved post %13.9.
Updating KR scoring to 95%.
2020-09-04
Simulation day went ahead as planned and was a major success. The team went through a full incident response process using a recent incident as the basis for the simulation. This will provide the basis for a gap analysis and demonstrated a high level of collaboration and buy in for this dogfooding effort. Great job team!
- Video of simulation day: https://drive.google.com/file/d/1_gOTWIy_cHp_ofam8bjofEn7Mm-lWETv/view
- Detailed notes from simulation day: https://docs.google.com/document/d/1IFRxwU2xsce0yWVcH_RalmnG02HNZ5H3Xv3sGQsevGA/edit
Updating KR scoring to 50%
2020-08-20
Simulation day is scheduled for 2020-09-02 and planning of the scenario/simulation has begun. A plan is in place to author the gap analysis and we anticipate being able to complete it in the 3 weeks following the simulation day. Updating scoring to 30%.
Retrospection
Good
- Simulation days resulted in a shared plan and expected timeline for fully adopting GitLab Incident Management.
- Infrastructure team seems highly bought into dogfooding and adopting this feature set.
- GMAU is now instrumenting, and increasing quickly
- Product was heavily involved in development of the gap analysis and that led to stronger alignment between Engineering and Product on the mid-to-long term plans for incident management.
Bad
- Still clarifying some details sequencing and prioritization between Infrastructure team needs and features needed for viable (e.g. on-call schedule management, notification)
- Monitor team realignment caused some disruption to the team, though there was not significant impact on the velocity.
- Little visibility into which Ultimate customers are using Incident management features.
- This was an engineering KR, and also part of our product direction. It wasn't clear who should own this effort. Worked out well being a joint effort between Product, Engineering and Infrastructure.
Try
- Improve granularity of team *MAU metrics to better understand adoption and existing customer behavior.
- Clarify strategy for balancing features which directly benefit GitLab infrastructure team (e.g. Incident Issue type) vs features needed to compete with competitive products (e.g. schedule management, paging)
- Evangelize dogfooding and external use cases now that usage is growing.