Update emergency handling procedures to include exception criteria

GitLab Support: Process Change Rollout Plan

Update emergency handling to include exception criteria

The Story

@rspainhower Link to updated handbook page as deliverable from the Issue #4537 (closed): https://about.gitlab.com/handbook/support/workflows/emergency_exception_workflow.html

The current public definition of emergency at https://about.gitlab.com/support/definitions/#severity-1 says:

Your instance of GitLab is unavailable or completely unusable. A GitLab server or cluster in production is not available, or is otherwise unusable.

While this is clear, it's so strict that it doesn't cover some situations where a company's misbehaving GitLab installation threatens business continuity. In contrast, our previous internal policy of "assume good intent" for incoming emergencies is too loose: handling every emergency page in real time is not sustainable.

This change attempts to strike a middle road: we maintain the strict definition in our terms, but have clearer criteria for when we should make an exception to our terms in the interest of the customer. To facilitate the initial triage, we also suggest starting the triage process async to get enough information to make the determination.

To facilitate faster feedback, it also introduces tags that SEs and Support Managers should apply to each incoming emergency through the provided macros.

General::Emergency::Strict Definition - for situations that meet the strict "system down" definition.
General::Emergency::Exception - for situations that fall under the exception criteria
General::Emergency::Needs more info - for situations that need more information (you can apply one of the above once you have enough information)
General::Emergency::Not an Emergency - for situations that don't qualify as an emergency under the strict definitions or exception criteria

See https://gitlab.com/gitlab-com/support/support-ops/zendesk-global/macros/-/merge_requests/498+ for content.

The Roles

Role	Description
Champions	@lyle @abuerer @mdunninger
Users	Support Engineers, Customers
Impacted Non-Users	CSMs

Schedule

Support Manager Preparedness up until 2022-12-14
Support Engineer Prepardness up through 2022-12-26
Full adoption no later than 2022-12-26
Active evaluation weekly from implementation + 4 weeks
- Have emergencies been tagged?
  - follow-up with on-call managers
- Review tagged emergencies in managers calls

Training

What do the users need to learn and how will they learn it? Do managers need to deliver training? Are there videos or tutorials or handbook pages or other materials?

Read through the MR
- Familiarize yourself with the new macros
- Read through the emergency exception workflow and look at the examples of the emergency / not emergency versions of a few situations
Take the quiz: https://forms.gle/GkBjjEs1BqkXQS9EA

Success Determination

Explain here how and what you will be monitoring to determine the success of the change. These are typical questions you might want to answer here:

What will success look like?

How will you track change adoption?

Is there a level of adoption that is required?

How will you measure success?

What are your targets (measured values that equate to success)?

Success will be a reduction in the number of synchronous calls required of on-call engineers and an increase in the number of high-priority cases handled by SGGs.

The actual measurables of this change will be if 100% of raised emergencies are tagged by an SE or Manager.

Action Plan

Follow-Up Plan

How will you follow-up to understand the results of the change, to make adjustments appropriately, and to rollback if necessary? These are typical questions you might want to answer here:

How will results be captured? By whom and by when?

What is the plan for considering and making quick improvements?

What is the plan should the change be deemed unsuccessful?

Is a rollback feasible, and if so how will it happen?

Edited Jan 19, 2023 by Rebecca Spainhower