Lightweight Retrospectives for Pagerduty Paged events
Request for comments
Need
Engineers on-call for Customer Emergencies, US Fed Customer Emergencies and CMOC are paged into (often) high urgency and high importance situations. At present we don't spend a significant amount of time identifying trends or finding ways to improve that might prevent similar emergencies in the future.
Some examples are:
- In #3114 (closed), @greg attempted to draw correlations between upgrades and emergencies. Because we relied on a specific post-action for on-call engineers, many tickets were not tagged.
- In a few cases, we've had customers who began to rely on emergency support for common issues and generated a number of emergencies that far outsized other customers of similar size.
Approach
For each pageable event:
- create a lightweight retrospective issue and assign to the IC, manager and any other participants in the emergency.
- categorize the retrospective issues into meaningful categories: For example: LR, Out of Scope Request, Upgrade, High ARR
- identify docs updates, bugs or process improvements that might have prevented this emergency
Benefit
- With a one-to-one mapping of retro issues and pageable events we have a set of issues we can use to analyze on-call load and detect trends
- Managers are more closely aware of the number of times their engineers are paged
- Process improvements, docs MRs and bugs are tied directly to emergency tickets, which could be used in prioritization
Competition / Alternatives
- Do nothing
Labels that might be useful
- ~"Low ARR" - customers with fewer than 50 seats
- ~"Failed Upgrade" - may want to convert to a scoped label
Emergency Type::Failed Upgrade
when/if we have more common reasons. - PlatformSelf-Managed PlatformSaaS
- ~"Incident-Comms::Status-Page" for things that make it to the status page
- ~"Multiple Emergencies" when a retro contains multiple paged events
Edited by Lyle Kozloff