Add GitLab.com Emergency Support
Problem Statement
Customers with Premium support on GitLab SaaS don't have access to Emergency Downtime Support.
This inconsistency in our support model leaves several high-priority classes of problems unaddressable at the appropriate SLA.
Specifically:
- bugs / regressions in features that affect only a small number of customers (that is, production issues that aren't really production issues: they're dev-escalation issues)
- self-hosted runner problems (e.g. runners that were previously working suddenly aren't picking up jobs from GitLab.com)
- security incidents that require rapid-action
Large customers have no way to get a 30m response time, or any response on the weekend.
Proposal
Start addressing Gitlab SaaS emergencies in the Customer Emergency PagerDuty rotation.
DRI
@lyle will act as DRI for this issue.
Required Resources
Engineers in the Customer Emergencies rotation should have:
-
GitLab.com admin access so that they can:
- Investigate reported issues
- Reproduce issues
- Get performance bar metrics for pages
-
GitLab.com staging access so that they can:
- Confirm that hot patches and rollbacks restore previous functionality
-
Training in:
- Pulling logs from Kibana
- Understanding and locating errors in Sentry
- Using GitLab.com admin carefully
- GitLab.com production escalation procedures
- Dev escalation procedures
- GitLab.com architecture
Potential Roadblocks/Things to consider
- Roll-out: opening this to all customers immediately could result in multiple, simultaneous emergencies where the resolution of those lies outside of the support team (e.g. a large production outage)
- => We need to understand what issues are commonly opened, and develop an understanding of the handoff procedure into the production team
- => We need to provide adequate training and support based on actual experiences
- We need to understand how the communication manager on call (CMOC), incident manager on call (IMOC), engineer on call (EOC) and Support Engineer in the Customer Emergency rotation on call (SECEROC - just kidding, I have no idea) formally work together during a large scale operational emergency.
- => For example, CMOC will be communicating with customers through status page updates while the SECEROC might be operating ZD in incident mode to point newly raised emergencies to the status page while helping diagnose the problem.
- Granting wider access to GitLab.com admin may not be looked on favorably by security
- What's the line between .com customer emergency and incident?
- What is a satisfying "emergency" experience when reporting an active incident?
- Is there a minimum number of seats that might receive this level of service?
- Is there additional cost?
Desired Outcome
Engineers in the Customer Emergency rotation are equipped to handle emergencies from any customer.
When a page comes in from a GitLab.com customer for an emergency situation the engineer on call, they know what to do.
- GitLab.com customers can open emergency tickets.
- Engineers know how to handle them.
TBD - we have to do something first
Related Issues/MRs/Epics/Tickets
Proposed Timeline
-
Aug 2020:
-
create training materials, -
train engineers in the rotation for September - Tracking in #2630 (closed) and %.com Emergency Roll Out in support-training -
conduct at least on practice per region
-
-
Sep 2020:
-
Onboard 2x (two) GitLab.com customers -
monitor impact and categorize issues that arise -
take feedback from engineers -
conduct at least one practice per region -
train engineers for Oct
-
-
Oct 2020:
-
Onboard an additional number of customers (based on Sept experience) -
monitor impact and categorize issues that arise -
take feedback from engineers -
conduct at least one practice per region -
train engineers for Nov
-
-
Nov 2020:
-
Onboard a percentage of GitLab.com customers -
monitor impact and categorize issues that arise -
take feedback from engineers -
conduct at least one practice per region -
train engineers for Dec
-
-
Dec 2020:
-
Onboard an additional percentage of GitLab.com customers -
monitor impact and categorize issues that arise -
take feedback from engineers -
conduct at least one practice per region -
train engineers for Jan
-
-
Jan 2021 (???)
-
All engineers are trained -
Announce general availability of GitLab.com emergency coverage
-
FAQ
- Why not a separate rotation?
Adding a separate rotation isn't out of the question, but adding additional roles adds additional complexity. We currently have:
- Customer Emergencies
- US Federal Customer Emergencies
- CMOC
They each have their own intricacies, but I don't think that GitLab.com emergencies are significantly different enough from self-managed emergencies to warrant the overhead of an additional rotation.
There are some differences to self-managed emergencies, but in many cases the initial steps will be very similar:
- Understand what the problem is
- Verify the impact of the problem
- Reproduce the problem reliably (sometimes very easy!)
- Seek to find the root cause
- Resolve the problem
In self-managed, often the resolution to problems is on the engineer/customer. On GitLab.com it may be on the engineer/customer, but it might equally have a resolution through the production and development teams.
- This feels scary, is it scary?
Yes, it will be a little scary at first. I want to move very slowly on this and make sure that every engineer in the rotation is trained and equipped with the knowledge they need. In some ways though, it's slightly less scary: the production and dev teams are direct escalation points and are on-call as well. Contrast that with the self-managed experience on the weekend, which can be a bit hair-raising.
- Does this add more work to being on-call?
Potentially. We'll have to pay attention to the workload that results from this. Initially, there should be minimal impact, as we'll only be adding a couple of customers as we develop training and get comfortable with hand-off procedures. This will be monitored continually.
Likely types of emergency pages
- broken functionality due to a regression being pushed to GitLab.com => reproduce, identify, escalate to have a patch created and deployed.
- broken functionality due to an inconsistency in data unique to the customer, for example: a group name used to be able to have special characters in it, and now something broke because our group name has a special character in it. => reproduce, identify, escalate to have the specific data corrected (and create a bug report so our code is better)
- GitLab.com access or "performance" degradation to the level of unusability. For example: no access in a geographical area, CI jobs aren't being dispatched => This is the hardest class, but will generally be operational emergencies. Success here means making sure it's not actually one of the top two before escalating to SRE
Suggested Communication to eligible customers:
We are piloting emergency support for our GitLab.com customers, and you're getting early access :) This means you will now receive 24x7 emergency support!
- The definitions of support impact still apply - (https://about.gitlab.com/support/#definitions-of-support-impact)
- Emergency support should only be triggered when your users are affected by GitLab being unavailable or completely unusable
- You still have Priority Support which has a 4hr SLA 24/5 for all other urgent issues.
Note: this is new to us too so, please be kind to the on-call engineer
🙂
- To reach emergency support, send a new ticket via email to: Emergency@email (again, this is only for emergencies when GitLab is unavailable or completely unusable)
- This will ping our on-call support engineer
- The support engineer may send some initial triage questions and information, and if appropriate will send a Zoom link within the support ticket
- If necessary: the support engineer on-call will debug the issue live on call with you
- For non-emergency support users can log a ticket here through our support portal: https://support.gitlab.com/hc/en-us