How to help engineers evaluate business risk during emergencies
Request for comments
Need
Our current definition of Emergency is:
GitLab instance in production is unavailable or completely unusable
This definition causes strain in two areas:
- Many emergencies that are raised do not strictly match this definition.
- losing access to Premium features does not make a production instance unavailable or unusable.
- losing access to a fleet of Runners does not make a GitLab instance unavailable or unusuable.
- Support Engineers on-call are caught between our definition of emergency and our assume good intent section of the guide to customer emergencies on-call.
The result is that we either aggresively out-of-scope non-emergency pages or spend many engineer hours on sync calls that could by handled async or in a scheduled synchronous call.
gitlab-com/marketing/digital-experience/buyer-experience!592 (diffs) took some steps to better address which situations might be emergencies and which might not be. However, these lists can never be exhaustive, and the longer they get the more often they're used as a knife to cut customers out of emergency support, rather than a ruler to come together to measure the impact of a situation and respond appropriately.
Approach
- Reframe our definition of emergency to be broad enough to cover cases of high business impact.
- Make sure Engineers and Managers have the tools and training needed to care for customers at an appropriate level of severity if there is a mismatch.
- Iterate on any processes which are required to support them.
Case studies / Hypotheticals:
- Knight Capital had a production bug that cost them $10 million/minute. If they had been using GitLab at the time of this incident and using GitLab CI to do a production deployment and their runners were down - would we have treat their case as an emergency?
- If you're a customer operating in a highly regulated environment where you have to have be able to trace every MR to an author and approver, losing access to Merge Request approvals may mean weeks of work to identify all MRs merged in the timeframe that the feature was not available to ensure that relevant approvers were present. The immediate business impact is negligible - GitLab would be operating and code could be merged, but reprioritizing effort in the coming weeks would be a significant cost to daily operations. Would we treat such a case as an emergency?
- Others??
Open questions:
- If we broaden our definition of emergency, will we see an increase in cases as a result?
- How do we serve customers well while not overburdening engineers on-call?
- If something doesn't qualify as an emergency, what does taking care of the customer look like?
- To what extent can we prevent emergencies? Are there actions we can take sooner to increase priority/attention to avoid a page?
- What would be useful as a manager or engineer? What's the right role for a Customer Emergencies On-Call to play in an emergency - do we have it right now, or does it need to change to support buisness-impact as a measure of "emergency"?
Comparisons:
- AWS Support - Business/Mission critical system down
- Atlassian Support - "production application down or major malfunction causing business revenue loss or high numbers of staff unable to perform their normal functions."
- GitHub Support splits it out for SaaS/SM:
- GitHub Enterprise Cloud - Production workflows for your organization or enterprise on GitHub Enterprise Cloud are failing due to critical service errors or outages, and the failure directly impacts the operation of your business.
- GitHub Enterprise Server - GitHub Enterprise Server is failing in a production environment, and the failure directly impacts the operation of your business.
What I'm looking for in this RFC
- Ideas - I can't promise popular ideas will result in a Requested Change, but all input will be considered.
- Personal stories - do you have experiences from past companies?
- Examples - Have you OOSed an emergency and taken care of their needs async or in a scheduled sync call?
- Data (or ideas of data to get) to help inform our approach
Related issues:
Edited by Greg Myers