Update emergency handling procedures to include exception criteria
GitLab Support: Process Change Rollout Plan
Update emergency handling to include exception criteria
The Story
@rspainhower Link to updated handbook page as deliverable from the Issue #4537 (closed): https://about.gitlab.com/handbook/support/workflows/emergency_exception_workflow.html
The current public definition of emergency at https://about.gitlab.com/support/definitions/#severity-1 says:
Your instance of GitLab is unavailable or completely unusable. A GitLab server or cluster in production is not available, or is otherwise unusable.
While this is clear, it's so strict that it doesn't cover some situations where a company's misbehaving GitLab installation threatens business continuity. In contrast, our previous internal policy of "assume good intent" for incoming emergencies is too loose: handling every emergency page in real time is not sustainable.
This change attempts to strike a middle road: we maintain the strict definition in our terms, but have clearer criteria for when we should make an exception to our terms in the interest of the customer. To facilitate the initial triage, we also suggest starting the triage process async to get enough information to make the determination.
To facilitate faster feedback, it also introduces tags that SEs and Support Managers should apply to each incoming emergency through the provided macros.
-
General::Emergency::Strict Definition
- for situations that meet the strict "system down" definition. -
General::Emergency::Exception
- for situations that fall under the exception criteria -
General::Emergency::Needs more info
- for situations that need more information (you can apply one of the above once you have enough information) -
General::Emergency::Not an Emergency
- for situations that don't qualify as an emergency under the strict definitions or exception criteria
See https://gitlab.com/gitlab-com/support/support-ops/zendesk-global/macros/-/merge_requests/498+ for content.
The Roles
Role | Description |
---|---|
Champions | @lyle @abuerer @mdunninger |
Users | Support Engineers, Customers |
Impacted Non-Users | CSMs |
Schedule
- Support Manager Preparedness up until 2022-12-14
- Support Engineer Prepardness up through 2022-12-26
- Full adoption no later than 2022-12-26
- Active evaluation weekly from implementation + 4 weeks
- Have emergencies been tagged?
- follow-up with on-call managers
- Review tagged emergencies in managers calls
- Have emergencies been tagged?
Training
What do the users need to learn and how will they learn it? Do managers need to deliver training? Are there videos or tutorials or handbook pages or other materials?
- Read through the MR
- Familiarize yourself with the new macros
- Read through the emergency exception workflow and look at the examples of the emergency / not emergency versions of a few situations
- Take the quiz: https://forms.gle/GkBjjEs1BqkXQS9EA
Success Determination
Explain here how and what you will be monitoring to determine the success of the change. These are typical questions you might want to answer here:
- What will success look like?
- How will you track change adoption?
- Is there a level of adoption that is required?
- How will you measure success?
- What are your targets (measured values that equate to success)?
Success will be a reduction in the number of synchronous calls required of on-call engineers and an increase in the number of high-priority cases handled by SGGs.
The actual measurables of this change will be if 100% of raised emergencies are tagged by an SE or Manager.
Action Plan
-
Announce the change and include The Story in the SWIR on 2022-10-13
(https://gitlab.com/gitlab-com/support/readiness/support-week-in-review/-/issues/272) -
Post a message in the #support_team-chat
slack channel (or other support channel as appropriate) announcing the change and pointing to the SWIR announcment on2022-10-13
: https://gitlab.slack.com/archives/CCBJYEWAW/p1665695531570349 -
Announce the change and tell The Story in Team meetings by date
-
EMEA team meeting -
AMER team meeting -
APAC team meeting
-
-
Other communications channels -
Discuss in 1-1s, telling The Story, by date
-
Other communications channels, if required - for example, post to a TAM channel if the TAMs will be impacted non-users
-
-
Report back on change adoption, concerns, etc. by date
Follow-Up Plan
How will you follow-up to understand the results of the change, to make adjustments appropriately, and to rollback if necessary? These are typical questions you might want to answer here:
- How will results be captured? By whom and by when?
- What is the plan for considering and making quick improvements?
- What is the plan should the change be deemed unsuccessful?
- Is a rollback feasible, and if so how will it happen?