Document criteria for potential exceptions to our definition of emergency

Problem Statement

What is the problem?

The current public definition of emergency at https://about.gitlab.com/support/definitions/#severity-1 says:

Your instance of GitLab is unavailable or completely unusable. A GitLab server or cluster in production is not available, or is otherwise unusable.

While clear, in practice we rarely downgrade emergencies despite many examples of emergencies that do not meet these criteria.

@rspainhower Link to updated handbook page as deliverable from this Issue: https://about.gitlab.com/handbook/support/workflows/emergency_exception_workflow.html

Why is this a problem?

In #4392 (closed) we discussed some of the reasoning for why integrating business impact into our definitions would allow us to handle issues that don't strictly meet the above definition, while still having a set of criteria we can judge situations against.

The root of the problem is that our strict definition of emergency is so strict that it's practically unenforceable. While strict enforcement would immediately reduce the amount of real-time work our on-call engineers face, the consequences to GitLab's business would be severe. With no documented exception process, there would be intense internal pressure to treat all emergencies as an exception whenever the customer has an advocate inside of GitLab.

With our current definition we have the options of:

ignoring the strict definition and operating on a set of undocumented conventions for incoming emergencies.
strictly adhering to the public definition and risking significant impact for our customers and GitLab.

Proposal

History and Context

The immediate proposal that came out of #4392 (closed) was to adopt a new definition of emergencies into our Support Definitions that takes into account business impact:

Severity 1 - Emergency

You are experiencing a problem with GitLab on one or more of your production GitLab servers or clusters. You have no reasonable workaround. And at least one of the following is true:

1. The problem will create a significant financial impact to your company either immediately or as an after-effect.
2. The problem is rendering GitLab essentially useless for your business purposes:
    1. One or more GitLab servers or services will not run, or
    2. GitLab features vital to your business’s daily operations are unavailable due to a license problem, or
    3. Your GitLab instance is suffering a significant and persistent performance degradation, or
    4. Basic operations such as clone, push, or login are not working
3. The problem will create immediate legal problems for your business
4. The problem will create immediate audit problems for your business

Directionally, I think this is the right way to go. These cover more cases while also being strict enough that we can downgrade an emergency in a defendable way that doesn't require technical correctness.

However, integrating this language into our contract terms will mean that if we want to tighten these definitions in the future we'll have to either require all customers to agree to the new terms or honor these terms until a customer's next renewal.

In short, integrating these directly into our terms makes this decision very difficult to undo.

Actually the proposal

In this issue I'm proposing that we:

retain the current restrictive terms defined our statement of support
introduce the expanded definitions of emergency into the CEOC workflow as a documented set of conditions under which we might consider an exception
revise the "assume good intent" language in the CEOC workflow to point to the "Business Impact" criteria

Rationale (why)

Having a looser set of conventions backed by a stricter set of contractual terms allows us to observe iterate on our definitions of business impact.
Documenting a set of conventions based on business impact is better than the alternatives of strict adherence or ignoring our current definition of Severity 1.
Having a set of exceptions to a strict policy lets Support Engineers be the "hero" instead of the "villain": Based on the business impact of this situation, I've gotten approval to treat this case as an Emergency.

DRI

@lyle will act as the DRI for this issue.

Required Resources

Training materials for SEs and Managers in using the the business impact definitions
Completed training for SEs on-call after implementation
Monitoring, evaluation and feedback issues for judging whether the business impact definitions are an effective tool in reducing synchronous time spent in emergency tickets.
MR for on-call page to update "best intent" in favor of business impact

Potential Roadblocks/Things to consider

This iteration on definitions of business impact is ineffective: all current emergencies qualify and there is no reduction in engineer effort spent on customer emergencies.
Support Managers and Engineers continue to operate under "best intent" and don't take the opportunity to reset severity on emergencies that wouldn't qualify for an exception.
I may be overestimating the impact strict adherence to our current definition of S1 / Emergency.

Desired Outcome

What does success look like?

I would consider this change successful if we saw a 10% reduction in synchronous calls handled by the on-call engineer.

How do we measure success?

For 30 days, using the lightweight retrospective tracker introduced in #4211 measure the number of emergencies taken to a sync call, and the number that reset to a more appropriate severity.

Where would future feedback go?

Immediate feedback can go in this issue - future feedback will have a dedicated issue.

Related Issues/MRs/Epics/Tickets

RFC: #4392 (closed)
Manager handbook updates: gitlab-com/www-gitlab-com!109463 (closed)
Draft MR to add business impact definitions to Definitions: gitlab-com/marketing/digital-experience/buyer-experience!923 (closed) (suggesting we close this one)

Edited Jan 19, 2023 by Rebecca Spainhower