Critical customer escalation process

Background

We want to define a clearer process and definitive response when handling a critical customer escalation. Currently customer escalations that requires engineering response happens with a variation of:

Issues are worked on immediately
Updates are happening in the issue, MR & slack
Priorities labels are used inconsistently for bugs or features

Proposal

The below proposal has been reviewed and merged to the handbook Handbook https://about.gitlab.com/handbook/engineering/#critical-customer-escalations

Critical Customer Escalations

We follow the below process when existing critical customer escalations requires immediate scheduling of bug fixes or development effort.

Requirements for critical escalation

Customer is in critical escalation state
The issues escalated have critical business impact to the customer, determined by Customer Success and Support Engineering leadership
- Failure to expedite scheduling may have cascading business impact to GitLab
Approval from a VP from Customer Success AND a Director of Support Engineering are required to expedite scheduling
- Customer Success: approval from either Sherrod Patching or David Sakamoto
- Support Engineering: approval from either Lee Matos or Lyle Kozloff or Shaun McCann or Val Parsons

Process

The issue priority is set to ~"priority::1" regardless of severity
The label ~"critical-customer-escalation" is applied to the issue
The issue is scheduled within 1 business day by the DRI
The DRI provides daily process updates in the escalated customer slack channel

DRI

If issue is type bug DRI is the Director of Development
If issue is type feature DRI is the Director of Product

The DRI can use the customer critical merge requests process to expedite code review & merge.

Tasks

Propose and review the process in this issue
MR updated to the handbook !121910 (merged)
Communicate updated process
- engineering-fyi slack channel
- engineering week in review document
- escalated customer slack channel

Prior context

We looked into introducing a Priority 0 earlier but this was a big change that would require a broad update to our security vulnerability response time. The above is a smaller iteration which focuses on addressing critical customer escalations without a disruption to Availability & Vulnerability.

We will continue iterating on implementing a P0 at a following date but this would require

An update to security vulnerability severity SLO, this would be better post fedramp
An update to availability severity SLO
An improvement in our delivery processes where we can guarantee a 12 hr deployment time

Edited Apr 24, 2023 by Mek Stittri