(RFC) Dealing with concurrent emergencies over the weekend in APAC

Need

We are beginning to see an increase of emergencies over the weekend and it's getting more likely that the on call engineer will be engaging in one emergency and then another emergency comes in while the first one is in progress (concurrent emergencies).

In these situations, I've observed that the Support Manager on call has to jump in on the second emergency and do their best to help that customer until either the on call engineer is free, or even resolve the emergency on their own. While Support Managers are capable of performing this role, it isn't necessarily part of their core responsibilities and while being engaged on an emergency, they are unable to fulfil any STARs that may come through. Essentially, Support Managers are starting to become an "unofficial" backup for on call engineers in these situations.

Sometimes we have Support Engineers that aren't on call observe that concurrent emergencies come through during the weekend in Slack and they then volunteer to take one of the emergencies. We don't expect Support Engineers to be on Slack during the weekends - but how do we easily reach Support Engineers that may want to help during the weekend when concurrent emergencies come through?

This RFC is proposing we introduce a proper process for this to locate another Support Engineer to assist when concurrent emergencies do come through in the weekend.

Additionally - this isn't really a problem during GitLab Support hours (weekdays) as we normally have other Support Engineers online and it's as simple as asking in Slack if anyone can take the other emergency. But in the weekends, it's much less likely that Support Engineers will be online and thus why I think we need a better process in place.

Approach

We currently have these escalation policies in APAC for Pagerduty:

On call engineer -> Support manager on call -> Directors

As a trial, I would propose we introduce a new on call pool in APAC:

Pool 1: On call engineer -> Support Manager on call -> Directors
Pool 2: Backup engineers

The backup engineers would be a group of Support Engineers that volunteer to take part. The idea here is that:

If an emergency comes through, the on call engineer would handle it as per the current process in pool 1.
If another emergency comes through while they are still on the first emergency, the on call engineer should escalate the page, instead of acknowledging/resolving it. This will ping the Support Manager on call.
The Support Manager on call then checks the current situation and determines if the backup engineers need to be pinged. If so, the Support Manager will then manually page the backup engineers on pool 2.
At this point, the backup engineers are all pinged. Ideally only one backup engineer needs to acknowledge the page and lend assistance.

The backup engineers are:

Anyone that volunteers to be part of pool 2
There is no expectation that tenure or seniority should dictate if you should be part of pool 2
There is no expectation that any backup engineers have to acknowledge and resolve a page. Ideally in this situation, we would love if one backup engineer acknowledges the page
This is not intended to be a tiered support approach - rather it is a backup mechanism for when the on call engineer is engaged on another emergency during the weekends. In practice, we expect pings to the backup engineers to be a last resort.
Never automatically pinged - the ping is manually initiated by a Support Manager.
It should be easy to add and remove yourself from the backup engineers pool. This is especially important if you are about to go on PTO.
We should also respect the APAC 1 and APAC 2 groups. Therefore if you volunteer to be a backup engineer and you are in APAC 1, you agree that you might be pinged in the weekend during the APAC 1 shift when you are not on call.

Benefit

The on call engineer no longer needs to worry about finding someone to take the other emergency and they can simply escalate the page to the Support manager on call and let them find another engineer to assist.
If multiple emergencies come through in the weekend, we can get other Support Engineers to pop in and assist easily.
There is less stress to both the on call engineer and the Support Manager on call as we are spreading the load out more effectively
This avoids the introduction of implementing a second roster for on call
Provided that a backup engineer responds to a ping, a Support Manager on call no longer needs to lead an emergency call if a concurrent emergency comes through during the weekend.

Competition / Alternatives

Do nothing
Create a second roster for on call. I don't think we want to do this yet as I feel the team needs to get larger first - and I doubt everyone wants to be on call more frequently at the moment.

What I need from you

Ultimately the goal here is to gauge interest in this. If it is implemented, I propose that we only roll it out for APAC as a trial. If it is successful, AMER/EMEA can adopt it if they want it too.

Please react with a 👍 if you like the idea, or a 👎 if you don't like it.

If you would be interested to be a backup engineer please react with a 🤚 emoji - this is to see if we have the numbers.

And I look forward to reading any comments/thoughts/concerns you might have on this!

Requirements for Support Ops

Add a new escalation policy with the SEs that reacted with at 🤚 on this issue. As of 2022/12/15 the SEs to include are:

Anton Smith
Justin Farmiloe
Kenneth Chu

Only one layer is needed. If no one responds to the ping, that is fine.

Can we go with Customer Emergencies - APAC Backup Pool for a policy name 😄

Edited Dec 15, 2022 by Anton Smith