Second-tier oncall pilot - Gitaly
Tracking issue / single point of entry for various things.
## Background
Currently, Developer Oncall (the escalation point for expert advice in emergencies) is a single engineer who most often does not happen to have deep knowledge covering the scope of the emergency at hand. This makes the Developer Oncall function of limited use to the organization, and a waste of engineers' time.
Team members, when in dire need of expert advice, often fall back to pinging several people at once on Slack or even reaching out through phone calls or SMS. This is both not guaranteed to _find_ any help, and also costly in interrupts to the (many) recipients.
We want to create a _more focused_ escalation path so that expert help can be had in a deterministic and simple way.
## Project
Create an improved dev oncall experience with smaller scope. Pilot this in Gitaly.
From early discussions:
- create a second tier PD escalation schedule where I would be able to page specific owners. Example: SRE responds to a page, requires help from DB or Gitaly team, and they go and use a PD trigger to page someone.
- They would have a 15-30 minutes expected response time
- The expectation is to not have a weekend schedule. This is to be able to iterate quickly without having to change contracts, sort out legal and compensation issues etc.
## Status
- [x] early feasibility study complete
- [x] explore pagerduty mechanics and cost
- [x] ^ explore alternatives if needed
- [x] **formulate a concrete proposal and get buy-in from Infrastructure and Product** -> https://gitlab.com/gitlab-com/content-sites/handbook/-/merge_requests/6279
- [x] announce the change and timeline
- [ ] set up necessary trainings
- [x] implement the logistics
- [ ] collect feedback 1, 3, 6 months after launching
## Implementation
tentative design: https://docs.google.com/document/d/14d_uGPBWlKSkzeaNwJLrUiahlU-Q_EiRrt14qjS25iU/edit
some code for feasibility studies: https://gitlab.com/andrashorvath/goon/-/tree/main
## Timeline
Specifically for Gitaly -- who are to be the first attempt at carving out a smaller -- we need enough oncallers to make this feasible. With new people joining this summer, tentative start is January 2025. That should also allow for all necessary approvals, setting up logistics and completing training needed.
epic