Development on-call process for GitLab.com infrastructure escalations
Development team needs to implement an on-call process and rotation schedule to support infrastructure team resolving operation escalations. Things to sort out:
The process:
- Who is the on-call engineer at the moment?
- How to reach the on-call engineer?
- What's the SLO of first response?
- What if on-call engineer doesn't carry domain expertise of the particular incident?
- What is the scope of this process? (GitLab.com and/or self-managed)
Rotation schedule:
- What's the guideline to optimize availability across time zones?
- Who are eligible to be on-call engineer?
- How long shall an engineer be on duty in one rotation?
- What is the length of one rotation cycle?
- Who nominates candidates to fill the slots?
- Where is the rotation schedule published?
- Should there be an escalation rotation as well (Director/Manager)?
@glopezfernandez Please respond with feedback, requirements.
Related information:
https://docs.google.com/document/d/1GfwzPc1uavB5ZuiA9O8l5sUD7Sq2l2BlwevM5AvOtZA/edit https://gitlab.com/groups/gitlab-org/-/boards/1193197?&label_name[]=gitlab.com&label_name[]=infradev
Edited by Chun Du