Proposal: GitLab Developer First Responder program
Background
The current Dev Oncall system suffers from a number of issues:
- It is a mandatory system which many engineers are not enthusiastic about participating in.
- Complications due to geo factors (some folks cant work weekends/nights etc)
- Lack of automation in incident escalation
- Lack of automation with scheduling
- Devs have little context when an escalation occurs.
Proposal (Updated)
Create a Developer First Responder oncall program.
This program will involve two components 1. Weekday First Responder and 2. Weekend Oncall
Weekday First Responder
- This will be a fully automated, bot-driven system
- Incidents will be escalated by the bot and randomly select a BE that is currently online and eligible due to their working hours. The bot would also factor recency into the algorithm so that BEs that had recently been involved in an incident would be selected last.
- During incidents a randomly selected BE has the option to pass the incident to another BE if they are urgently needed somewhere else.
- All volunteers for Weekend Oncall will be exempted from the Weekday First Responder
How incident escalation will look on a Weekday
- SRE et al, types
/incident
into #dev-escalation - A bot randomly selects a BE first responder based on: working hours, whether they are online and notifies them via slack/cell etc.
- BE responds to bot thread with
👀 - If Primary does not respond a secondary will be notified.
- BE triages the issue and works toward a solution.
- If necessary, BE reach out to domain experts as needed.
Weekend Oncall
- This program provides coverage outside of normal company working hours
- We assume a "weekend" encompasses about 36 hours due to global TZ coverage.
- This is a volunteer-based system made up of those who are legally eligible to participate.
- Oncall rotation is perpetual and automatically scheduled.
- Volunteers can trade shifts between each other
- All volunteers for Weekend Oncall will be exempted from the Weekday First Responder
How incident escalation will look on a Weekend
- SRE et al, types
/incident
into #dev-escalation - A bot notifies the oncall volunteer via slack/cell etc.
- BE responds to bot thread with
👀 - If Primary does not respond a secondary will be notified.
- BE triages the issue and works toward a solution.
- If necessary, BE reach out to domain experts as needed.
Goals
- The goal here would be to identify individuals who are actually interested in being oncall and learning production engineering.
- This opens cross-training potential with the SRE team and allows Product Engineers to explore their interest in scalability etc.
- It also allows Product Devs with previous Infra or production engineering experience to continue expanding those skillsets (many folks within Product Engineering including myself are former SREs).
- The aim would to make this an opportunity to learn new skills and contribute outside their primary product area at GitLab
Pilot
I propose a pilot program, which would require a minimum of 50 volunteers to be successful. Each engineer would be expected to cover approx 4 x 4 hours shifts per month.
Edited by Nick Klick