@@ -94,7 +94,7 @@ These principles help maintain efficiency while ensuring every emergency has cle
- 🎫 Maintain your regular workload during the week prior.
- 📅 Toward the end of the week (Thursday-Friday), look through your queue:
- Identify the tickets that will need to be [handed over](https://gitlab.com/gitlab-com/support/support-team-meta/-/issues/6371)(i.e. High priority tickets, high touch tickets, STAR’ed or escalated customers)
- Identify the tickets that will need to be [handed over](https://gitlab.com/gitlab-com/support/support-team-meta/-/issues/6371)(i.e. High priority tickets, high touch tickets, STAR'ed or escalated customers)
- Leave the summary you would want to receive
- Work with your network/peers/Support Pod to find an Assignee for each of those tickets
- During the week before you are on-call, discuss tickets that need to be handed over with your manager. Assign these tickets to them to ensure they have a DRI and chat through Next Steps as needed. (It's expected that your Manager will help with finding an Assignee to work on the ticket.)
@@ -223,7 +223,7 @@ As the CEOC you will work with the customer along with other Support Engineers t
1. If the emergency was raised due to a GitLab.com Incident, follow [customer emergencies are triggered by a GitLab incident](#customer-emergencies-are-triggered-by-a-gitlab-incident).
1. Monitor the number of Support Engineers actively participating in the emergency response. If multiple Support Engineers join the call to assist, assess whether all are actively contributing to the resolution. As the DRI, don't hesitate to ask colleagues to step back from active participation if their input isn't currently needed - they can monitor passively and re-engage if circumstances change. Note that Support Engineers [shadowing customer emergencies](#customer-emergency-shadow-pagerduty-schedule) should continue observing as part of their learning process.
**NOTE:** If you need to reach the current on-call engineer and they're not accessible on Slack (e.g., it's a weekend, or the end of a shift), you can [manually trigger a PagerDuty incident](https://support.pagerduty.com/main/docs/incidents#trigger-an-incident) to get their attention, selecting **Customer Support** as the Impacted Service and assigning it to the relevant Support Engineer.
**NOTE:** If you need to reach the current on-call engineer and they're not accessible on Slack (for example, it's a weekend, or the end of a shift), you can [manually trigger a PagerDuty incident](https://support.pagerduty.com/main/docs/incidents#trigger-an-incident) to get their attention, selecting **Customer Support** as the Impacted Service and assigning it to the relevant Support Engineer.
### Stage 4: Resolve
@@ -312,9 +312,9 @@ When an emergency request ticket does not contain information sufficient to
allow you to determine the appropriate path forward, send the
customer a message through the ticket:
1. explaining that in order to correctly categorise the situation, you would
1. explaining that in order to correctly categorize the situation, you would
like to understand more about the effect it is having on their ability to
work or to meet their business objectives (*i.e.* business impact)
work or to meet their business objectives (*that is* business impact)
1. asking for the specific additional context that you require in order to
understand what problem they are facing and what help they need
@@ -365,14 +365,14 @@ The important details to include in the message are:
##### (Optional) Contact the on-call Support Manager
If at any point you would like advice or help finding additional support, [contact the on-call Support Manager](/handbook/support/on-call/#engaging-the-on-call-manager). The on-call manager is there to support you. They can locate additional Support Engineers if needed. This can make it easier to handle a complex emergency by having more than one person on the call, so you can share responsibilities (e.g., one person takes notes in Slack while the other communicates verbally on the call). Managers are on-call during weekends, so you can page for help at any time.
If at any point you would like advice or help finding additional support, [contact the on-call Support Manager](/handbook/support/on-call/#engaging-the-on-call-manager). The on-call manager is there to support you. They can locate additional Support Engineers if needed. This can make it easier to handle a complex emergency by having more than one person on the call, so you can share responsibilities (for example, one person takes notes in Slack while the other communicates verbally on the call). Managers are on-call during weekends, so you can page for help at any time.
#### Handling multiple simultaneous emergencies
In rare cases, the on-call engineer may experience concurrent emergencies triggered by separate customers. If this happens to you, please remember that you are not alone; you need only take the first step in the following process to ensure proper engagement and resolution of each emergency:
1.**You**: [Contact the on-call Support Manager](/handbook/support/on-call/#engaging-the-on-call-manager) to inform them of the new incoming emergency. The Support Manager is responsible for finding an engineer to own the new emergency page.
1.**Support Manager**: In Slack, ping the regional support group (*e.g.*`@support-team-americas`) and request assistance from anyone who is available to assist with the new incoming emergency case.
1.**Support Manager**: In Slack, ping the regional support group (*for example*`@support-team-americas`) and request assistance from anyone who is available to assist with the new incoming emergency case.
1.**Second Support Engineer**: Acknowledge and resolve the emergency page to indicate that you are assisting the customer with the case.
#### Customer emergencies are triggered by a GitLab incident
@@ -441,7 +441,7 @@ Before ending an emergency customer call, let the customer know what to do if th
For example:
> It seems like we've solved the root problem here, but if you need any help I'll be on-call for the next two hours. Feel free to **open a new emergency ticket** and I'll get back on a call with you right away. If it's after two hours, my colleague Francesca will be responding. I'll make sure that she has the background of the situation before I leave for the day.
> It seems like we've solved the root problem here, but if you need any help I'll be on-call for the next two hours. Feel free to **open a new emergency ticket** and I'll get back on a call with you right away. If it's after two hours, my colleague Francesca will be responding. I'll make sure that they have the background of the situation before I leave for the day.
When the call has ended:
@@ -463,7 +463,7 @@ Situations may arise where a customer emergency has not been resolved, but they
For example:
> We were not able to get to a resolution today and I understand you will be away until tomorrow morning. If you come back to this and need any help, I'll be on-call for the next two hours. Feel free to **open a new emergency ticket** and I'll get back on a call with you right away. If it's after two hours, my colleague Francesca will be responding. I'll make sure that she has the background of the situation before I leave for the day.
> We were not able to get to a resolution today and I understand you will be away until tomorrow morning. If you come back to this and need any help, I'll be on-call for the next two hours. Feel free to **open a new emergency ticket** and I'll get back on a call with you right away. If it's after two hours, my colleague Francesca will be responding. I'll make sure that they have the background of the situation before I leave for the day.
When the call has ended:
@@ -539,21 +539,19 @@ even with specialized expertise, it may take time for the on-call engineer to
get up to speed on the specific aspect of GitLab that is the focus of the
emergency.
To escalate to a subject matter expert, refer to
To escalate to a Subject Matter Expert (SME), refer to
the [Tier 2 On-Call Program](/handbook/engineering/infrastructure-platforms/incident-management/on-call/tier-2/)
for the appropriate team and escalation criteria.
for the appropriate team and [escalation criteria](/handbook/engineering/infrastructure-platforms/incident-management/on-call/tier-2/#escalation-criteria).
You can also refer to the [Engineering Directory](#engineering-directory) to help you identify the relevant teams responsible for the feature or subject area you need help on.
When initiating developer escalations ([including Gitaly EOCs](#for-gitaly-specific-emergencies)), monitor whether their continued active participation remains necessary as the situation evolves. If the issue moves away from their area of expertise, proactively ask if they can disengage from active participation while remaining available for re-engagement if needed. This helps maintain focus and reduces noise for both the customer and Support.
Most Tier 2 SME rotations provide 24x5 coverage (Monday-Friday). Weekend escalations may have limited availability. Always check the specific team's [coverage](/handbook/engineering/infrastructure-platforms/incident-management/on-call/tier-2/#active-tier-2-rotations) details before escalating.
#### For Gitaly specific emergencies
When escalating to SME, monitor whether their continued active participation remains necessary as the situation evolves. If the issue moves away from their area of expertise, proactively ask if they can disengage from active participation while remaining available for re-engagement if needed. This helps maintain focus and reduces noise for both the customer and Support.
If the emergency requires specialized Gitaly/Gitaly Cluster expertise, you can escalate directly to the Gitaly EOC. See the [Gitaly on-call rotation](../../engineering/infrastructure-platforms/tenant-scale/gitaly/_index.md#on-call-rotation) for the escalation process and coverage details.
You can also refer to the [Engineering Directory](#engineering-directory) to help you identify the relevant teams responsible for the feature or subject area you need help on.
## Other forms of Emergencies
Customer Support provides 24/7 coverage for customers subscribed to GitLab’s Advanced and Signature Success Tiers. These premium tiers, which include access to a [Customer Success Architect (CSA)](/handbook/customer-success/csm/segment/csa/), require continuous support and faster response times for Severity 2 issues (labeled as High Priority tickets in Zendesk).
Customer Support provides 24/7 coverage for customers subscribed to GitLab's Advanced and Signature Success Tiers. These premium tiers, which include access to a [Customer Success Architect (CSA)](/handbook/customer-success/csm/segment/csa/), require continuous support and faster response times for Severity 2 issues (labeled as High Priority tickets in Zendesk).
@@ -692,7 +690,7 @@ We're expecting, broadly that emergencies will fall into one of five categories:
- Success may mean: reproducing, identifying or creating a bug report and escalating to have a patch created and deployed.
-**broken functionality due to an inconsistency in data unique to the customer**, for example: a group name used to be able to have special characters in it, and now something broke because our group name has a special character in it.
- Success may mean reproducing the error, identifying it Sentry/Kibana, escalating to have the specific data corrected (and creating a bug report so our code is better)
-**GitLab.com access or "performance" degradation to the level of unusability**, for example: no access in a geographical area, CI jobs aren't being dispatched. This is the hardest class, but will generally be operational emergencies.
-**GitLab.com access or "performance" degradation to the level of being unusable**, for example: no access in a geographical area, CI jobs aren't being dispatched. This is the hardest class, but will generally be operational emergencies.
- Success here means making sure it's not actually one of the top two before [declaring an incident](/handbook/engineering/infrastructure-platforms/incident-management/#report-an-incident-via-slack) and letting the SRE team diagnose and correct the root cause.
-**License / Consumption issues are preventing access to the product**
@@ -702,7 +700,7 @@ We're expecting, broadly that emergencies will fall into one of five categories:
#### Broken Functionality
If a customer is reporting that behaviour has recently changed, first check [GitLab.com Status](https://status.gitlab.com) and `#incidents` for any on-going incidents. If there's no known incident:
If a customer is reporting that behavior has recently changed, first check [GitLab.com Status](https://status.gitlab.com) and `#incidents` for any on-going incidents. If there's no known incident:
1. Initiate a call with the customer. You're specifically looking to:
- observe broken behavior.
@@ -729,7 +727,7 @@ If there is a known incident, it's acceptable to link to the public status page
##### Example tickets
-[Feature flag broke previously working behaviour](https://gitlab.zendesk.com/agent/tickets/204073): resolution was to turn off a feature-flag.
-[Feature flag broke previously working behavior](https://gitlab.zendesk.com/agent/tickets/204073): resolution was to turn off a feature-flag.
-[Regression on GitLab.com broke previously working pipeline](https://gitlab.zendesk.com/agent/tickets/147266): resolution was to revert a recently deployed MR.
-[Customer locked themselves out of their group by changing SAML settings](https://gitlab.zendesk.com/agent/tickets/146611)
@@ -806,17 +804,17 @@ US Government on-call support is provided 7 days a week between the hours of 050
The current on-call schedule can be viewed in [PagerDuty](https://gitlab.pagerduty.com/schedules#P89ZYHZ)(Internal Link). The schedule is currently split into three, 8 hour shifts which roughly correlate with the dayshift, evening, and overnight team member hours:
- Dayshift: 05:00 - 13:00 PT
- Day-shift: 05:00 - 13:00 PT
- Evenings: 13:00 - 21:00 PT
- Overnight: 21:00 - 05:00 PT
Customers are permitted to submit emergencies via email or via the emergency form in the US Government support portal.
Customers are permitted to submit emergencies using email or through the emergency form in the US Government support portal.
#### On-call Shift Coverage in US Government
In the event that a Support Engineer needs coverage for a scheduled On-call shift, open an issue in Support Team Meta using the `us-gov-oncall-coverage` template.
Dayshift engineers needing coverage on a **non-holiday weekday** may give the shift to the Support Bot. To do so, open an issue in Support Team Meta using the `us-gov-oncall-coverage` template and mention your manager for review. After ensuring that the shift(s) in question do not fall on a weekend or holiday remove the override for your shift in PagerDuty and ensure it falls back to the bot user.
Day-shift engineers needing coverage on a **non-holiday weekday** may give the shift to the Support Bot. To do so, open an issue in Support Team Meta using the `us-gov-oncall-coverage` template and mention your manager for review. After ensuring that the shift(s) in question do not fall on a weekend or holiday remove the override for your shift in PagerDuty and ensure it falls back to the bot user.
#### Emergencies outside on-call hours
@@ -848,7 +846,7 @@ As appropriate, you can use the section on [escalating emergency issues](/handbo
### Supporting 24/7 Coverage for Customers on the Advanced or Signature Success Tier - Phase 1
Customer Support provides 24/7 coverage for customers subscribed to GitLab’s Advanced and Signature Success Tiers. These premium tiers, which include access to a [Customer Success Architect (CSA)](/handbook/customer-success/csm/segment/csa/), require continuous support and faster response times for Severity 2 issues (labeled as High Priority tickets in Zendesk).
Customer Support provides 24/7 coverage for customers subscribed to GitLab's Advanced and Signature Success Tiers. These premium tiers, which include access to a [Customer Success Architect (CSA)](/handbook/customer-success/csm/segment/csa/), require continuous support and faster response times for Severity 2 issues (labeled as High Priority tickets in Zendesk).
@@ -876,7 +874,7 @@ We as a company want to treat High Priority tickets, especially from customers w
In cases where:
- An [assigned support engineer](/handbook/support/enhanced-support-offerings/offering-assigned-support-engineer/) has opted to automatically assign their customer's tickets to themselves; and
- The customer creates a Sev 2 ticket on a weekend.
- The customer creates a "Sev 2" (Severity 2) ticket on a weekend.