Critical customer escalation process
Background
We want to define a clearer process and definitive response when handling a critical customer escalation. Currently customer escalations that requires engineering response happens with a variation of:
- Issues are worked on immediately
- Updates are happening in the issue, MR & slack
- Priorities labels are used inconsistently for bugs or features
Proposal
The below proposal has been reviewed and merged to the handbook Handbook https://about.gitlab.com/handbook/engineering/#critical-customer-escalations
Critical Customer Escalations
We follow the below process when existing critical customer escalations requires immediate scheduling of bug fixes or development effort.
Requirements for critical escalation
- Customer is in critical escalation state
- The issues escalated have critical business impact to the customer, determined by Customer Success and Support Engineering leadership
- Failure to expedite scheduling may have cascading business impact to GitLab
- Approval from a VP from Customer Success AND a Director of Support Engineering are required to expedite scheduling
- Customer Success: approval from either Sherrod Patching or David Sakamoto
- Support Engineering: approval from either Lee Matos or Lyle Kozloff or Shaun McCann or Val Parsons
Process
- The issue priority is set to
~"priority::1"
regardless of severity - The label
~"critical-customer-escalation"
is applied to the issue - The issue is scheduled within 1 business day by the DRI
- The DRI provides daily process updates in the escalated customer slack channel
DRI
- If issue is type bug DRI is the Director of Development
- If issue is type feature DRI is the Director of Product
The DRI can use the customer critical merge requests process to expedite code review & merge.
Tasks
-
Propose and review the process in this issue -
MR updated to the handbook !121910 (merged) - Communicate updated process
-
engineering-fyi slack channel -
engineering week in review document -
escalated customer slack channel
-
Prior context
We looked into introducing a Priority 0 earlier but this was a big change that would require a broad update to our security vulnerability response time. The above is a smaller iteration which focuses on addressing critical customer escalations without a disruption to Availability & Vulnerability.
We will continue iterating on implementing a P0 at a following date but this would require
- An update to security vulnerability severity SLO, this would be better post fedramp
- An update to availability severity SLO
- An improvement in our delivery processes where we can guarantee a 12 hr deployment time