FY21Q1 OKR (Lyle) - Staff on-call rotations => 80%
-
100% of former agents have completed the CMOC bootcamp -
100% of former agents are participating in the CMOC emergency rotation -
n% of former self-managed engineers are participating in CMOC emergency rotation -
100% of support engineers are scheduled into an emergency rotation within 8 weeks of being onboarded. -
Opened support-training!325 (merged) to address this
-
-
Reduce variable cost of staffing on-call rotation to a fixed cost by developing quarterly staffing model. -
Run 3 mock GitLab.com incidents (with Dave Smith) (per region) -
Incident Practice 1: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7964 -
Incident Practice 2: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9714 -
Incident Practice 3: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9792 - Make sure to practice hand-off procedure.
-
OKR Scoring: 80%
Good:
- got CMOC rolled out globally
- ran practice incidents in every region, and have plans to continue with infra team monthly
- made onboarding into on-call rotations the same for both rotations
Bad:
- Reducing cost is logistically complex (but I found we're paying much less than I had thought). Continuing this thread of thought in #2342 (closed) where we investigate how on-call rotations interact with each other.
Try:
- It might be worth formalizing the "go live" issues for things that require specialized training. It seems to have worked for launching, but might have been more smooth if it was something we'd done before.
Edited by Lyle Kozloff