Monthly review of Incident and Pager trends
Summary
There are a number of mechanisms for regularly reviewing individual incidents beyond @jarv's monthly availability highlights (2023-02) but nothing around pager load or incident trends. Most efforts around trends come from reacting to problems such as higher than normal incident or pager load perception from an individual and are then disbanded when the goal is achieved. Such as the Engineer Quality of Life Squad.
In order to maintain our greatly improved availability key result we should seek to be proactive about incident and pager load trends.
The process should:
- Be async; in line with our values. Another meeting is not needed.
- Open to anyone at GitLab but compulsory for the Engineers who participate in the Engineer on Call pool.
- Focused on identifying trends in incidents and pager load.
- Leverage the Monthly Availability Highlights report (2023-02) as the starting point for incidents.
- Leverage the Monthly Availability Highlights report (2023-02) and Alert Analytics dashboard as the starting point for pager load.
- Raise the appropriate InfraDev, Corrective Action or Reliability Improvement Issue for the owning team to execute.
- Feed data into the Reliability Engineering Leadership Team to be reviewed monthly in Reliability Leadership Meeting for effectiveness.
- The DRI is the Sr. Director, Infrastructure for Reliability but should transition to a team dedicated to managing reliability frameworks when established.
Next steps
-
Solicit feedback on approach. DRI: @alanrichards. -
Create process and dry run on 2023-07 highlights. DRI: @afappiano -
Create/Update required handbook pages. DRI: @afappiano -
Run documented process on 2023-08 availability report. DRI: @afappiano.
Edited by Anthony Fappiano