Add GitLab.com Emergency Support

Problem Statement

Customers with Premium support on GitLab SaaS don't have access to Emergency Downtime Support.

What is the problem?

This inconsistency in our support model leaves several high-priority classes of problems unaddressable at the appropriate SLA.

Specifically:

bugs / regressions in features that affect only a small number of customers (that is, production issues that aren't really production issues: they're dev-escalation issues)
self-hosted runner problems (e.g. runners that were previously working suddenly aren't picking up jobs from GitLab.com)
security incidents that require rapid-action

Why is this a problem?

Large customers have no way to get a 30m response time, or any response on the weekend.

Proposal

Start addressing Gitlab SaaS emergencies in the Customer Emergency PagerDuty rotation.

DRI

@lyle will act as DRI for this issue.

Required Resources

Engineers in the Customer Emergencies rotation should have:

GitLab.com admin access so that they can:
- Investigate reported issues
- Reproduce issues
- Get performance bar metrics for pages
GitLab.com staging access so that they can:
- Confirm that hot patches and rollbacks restore previous functionality
Training in:
- Pulling logs from Kibana
- Understanding and locating errors in Sentry
- Using GitLab.com admin carefully
- GitLab.com production escalation procedures
- Dev escalation procedures
- GitLab.com architecture

Potential Roadblocks/Things to consider

Roll-out: opening this to all customers immediately could result in multiple, simultaneous emergencies where the resolution of those lies outside of the support team (e.g. a large production outage)
- => We need to understand what issues are commonly opened, and develop an understanding of the handoff procedure into the production team
- => We need to provide adequate training and support based on actual experiences
We need to understand how the communication manager on call (CMOC), incident manager on call (IMOC), engineer on call (EOC) and Support Engineer in the Customer Emergency rotation on call (SECEROC - just kidding, I have no idea) formally work together during a large scale operational emergency.
- => For example, CMOC will be communicating with customers through status page updates while the SECEROC might be operating ZD in incident mode to point newly raised emergencies to the status page while helping diagnose the problem.
Granting wider access to GitLab.com admin may not be looked on favorably by security
What's the line between .com customer emergency and incident?
What is a satisfying "emergency" experience when reporting an active incident?
Is there a minimum number of seats that might receive this level of service?
Is there additional cost?

Desired Outcome

Engineers in the Customer Emergency rotation are equipped to handle emergencies from any customer.

What does success look like?

When a page comes in from a GitLab.com customer for an emergency situation the engineer on call, they know what to do.

How do we measure success?

GitLab.com customers can open emergency tickets.
Engineers know how to handle them.

Where would future feedback go?

TBD - we have to do something first 😄

Related Issues/MRs/Epics/Tickets

#2124 (closed)

Proposed Timeline

Aug 2020:
- create training materials,
- train engineers in the rotation for September - Tracking in #2630 (closed) and %.com Emergency Roll Out in support-training
- conduct at least on practice per region
Sep 2020:
- Onboard 2x (two) GitLab.com customers
- monitor impact and categorize issues that arise
- take feedback from engineers
- conduct at least one practice per region
- train engineers for Oct
Oct 2020:
- Onboard an additional number of customers (based on Sept experience)
- monitor impact and categorize issues that arise
- take feedback from engineers
- conduct at least one practice per region
- train engineers for Nov
Nov 2020:
- Onboard a percentage of GitLab.com customers
- monitor impact and categorize issues that arise
- take feedback from engineers
- conduct at least one practice per region
- train engineers for Dec
Dec 2020:
- Onboard an additional percentage of GitLab.com customers
- monitor impact and categorize issues that arise
- take feedback from engineers
- conduct at least one practice per region
- train engineers for Jan
Jan 2021 (???)
- All engineers are trained
- Announce general availability of GitLab.com emergency coverage

FAQ

Why not a separate rotation?

Adding a separate rotation isn't out of the question, but adding additional roles adds additional complexity. We currently have:

Customer Emergencies
US Federal Customer Emergencies
CMOC

They each have their own intricacies, but I don't think that GitLab.com emergencies are significantly different enough from self-managed emergencies to warrant the overhead of an additional rotation.

There are some differences to self-managed emergencies, but in many cases the initial steps will be very similar:

Understand what the problem is
Verify the impact of the problem
Reproduce the problem reliably (sometimes very easy!)
Seek to find the root cause
Resolve the problem

In self-managed, often the resolution to problems is on the engineer/customer. On GitLab.com it may be on the engineer/customer, but it might equally have a resolution through the production and development teams.

This feels scary, is it scary?

Yes, it will be a little scary at first. I want to move very slowly on this and make sure that every engineer in the rotation is trained and equipped with the knowledge they need. In some ways though, it's slightly less scary: the production and dev teams are direct escalation points and are on-call as well. Contrast that with the self-managed experience on the weekend, which can be a bit hair-raising.

Does this add more work to being on-call?

Potentially. We'll have to pay attention to the workload that results from this. Initially, there should be minimal impact, as we'll only be adding a couple of customers as we develop training and get comfortable with hand-off procedures. This will be monitored continually.

Likely types of emergency pages

broken functionality due to a regression being pushed to GitLab.com => reproduce, identify, escalate to have a patch created and deployed.
broken functionality due to an inconsistency in data unique to the customer, for example: a group name used to be able to have special characters in it, and now something broke because our group name has a special character in it. => reproduce, identify, escalate to have the specific data corrected (and create a bug report so our code is better)
GitLab.com access or "performance" degradation to the level of unusability. For example: no access in a geographical area, CI jobs aren't being dispatched => This is the hardest class, but will generally be operational emergencies. Success here means making sure it's not actually one of the top two before escalating to SRE

Suggested Communication to eligible customers:

We are piloting emergency support for our GitLab.com customers, and you're getting early access :) This means you will now receive 24x7 emergency support!

The definitions of support impact still apply - (https://about.gitlab.com/support/#definitions-of-support-impact)

Emergency support should only be triggered when your users are affected by GitLab being unavailable or completely unusable

You still have Priority Support which has a 4hr SLA 24/5 for all other urgent issues.

Note: this is new to us too so, please be kind to the on-call engineer 🙂

To reach emergency support, send a new ticket via email to: Emergency@email (again, this is only for emergencies when GitLab is unavailable or completely unusable)

This will ping our on-call support engineer

The support engineer may send some initial triage questions and information, and if appropriate will send a Zoom link within the support ticket

If necessary: the support engineer on-call will debug the issue live on call with you

For non-emergency support users can log a ticket here through our support portal: https://support.gitlab.com/hc/en-us

Edited Apr 13, 2021 by Nathan Black

Add GitLab.com Emergency Support

Problem Statement

What is the problem?

Why is this a problem?