Incident Manager Onboarding - Chad Woolley

This issue is for training and onboarding to be a GitLab Incident Manager.

IM Onboarder Details

Onboarder - GitLab username
Ideal shift: 17:00 UTC - 23:00 UTC - shift (see handbook for details)
Add IM-OnboardingTraining when you begin working through this issue

Companion learning resource

We have created a Incident Manager Training LevelUp course to be a companion resource to this onboarding issue. We are continually seeking to improve the offering, so please do give us feedback here if you see things that could be better.
There is also an extensive Frequently Asked Questions section in the Incident Manager onboarding handbook page

The goal of incident response:

The goal of the incident response process is to mitigate customer and/or business impact and restore service to its previous condition. We should favor mitigating the impact over understanding the underlying cause.

Example: The Engineer On-Call may decide to initiate a rollback to a known good version, even if the underlying cause of the problem is unknown.

There are some basic principles of incident response you should be aware of:

Declare incidents early and often.
Designate clearly defined roles.
Maintain a clear line of command.
Keep a working record of debugging and mitigation as you go.
Facilitate smooth hand-offs between responders coming and going on/off shift.

Source: https://sre.google/workbook/incident-response/

What is the role of Incident Manager?

An Incident Manager:

Commands and coordinates the incident response, delegating roles as needed.
Communicates effectively to the team and works with the Communication Manager on Call(CMOC) for external and executive communication.
Stays in control of the incident response.
Facilitates other responders so they can mitigate and resolve the incident.

Source: https://sre.google/workbook/incident-response/

What is not part of Incident Manager role?

An Incident Manager should not be directly engaging in debugging, troubleshooting, or creating technical fixes. These activities should be delegated to the GitLab Engineer On-Call (EOC) and other responding engineers so the Incident Manager can maintain operational awareness, communicate status, and coordinate the response.

See also the description of Roles and Responsibilities in our Incident Management documentation.

How does an Incident Manager effectively engage with the Engineer On-Call?

During a high-profile and high-impact incident (e.g severity 1), one of your primary responsibilities as Incident Manager is to help lower the stress levels of the Engineer On-Call.

Some methods that can be employed to accomplish this:

Act as a servant leader; ask the EOC what they need.
- Example: Do you need me to bring in someone from team X / dev escalations / CMOC?
Manage the incident room to keep interruptions to a minimum.
1. The purpose of the incident Zoom is for coordinating the technical investigation and mitigation.
2. Assist the CMOC with drafting status page updates off-call in a Slack thread, to keep the Zoom call focused.
3. If Directors or VPs join the incident call, direct their questions to Slack or a separate Zoom call.
4. Interrupt the EOC to ask for an update or a clarification when needed.
5. When the investigation is ongoing, get out of the way.
Manage the incident issue.
1. During a user-facing incident we may get lots of user reports and other comments on the incident issue.
2. When this happens, you will probably want to lock the incident issue to keep the information focused.
3. This helps the EOC and everyone working on the incident to understand the current status and ongoing threads of investigation.
4. Avoiding audience participation can also help lower the stress of everyone involved.

What does an Incident Manager do during an incident?

Don’t Panic
- Incident management can sometimes feel stressful. Don’t panic. Assemble the team of people you need to support you, follow the process, and don’t panic.
Clearly Communicate Current Status
1. In the early stages of an incident, ask for an update from the EOC and any other engaged engineers every 20-30 mins.
2. Ask the Communications Manager On-Call (CMOC) for an update on new customer reports if we don't yet have clear data.
3. Screenshot (or ask others to screenshot) charts showing changes in impact.
4. Report a status updates summary regularly (every 15-20 min) in the incident Slack channel. These updates provide critical information that help the team members across the company coordinate our response
If You’re Stuck Ask Probing Questions
1. If the team of responders is stuck and not sure what to do, you should ask probing questions to help unblock the team’s thinking. Assign people to investigatory tasks as you generate ideas.
2. Some example probing questions:
  1. What is the current impact on users? Is the service unavailable, slow, partially available?
  2. Can we rollback to a known good version? Can someone confirm if that is safe? Even if we don’t yet know exactly what’s causing this could a rollback restore service while we continue investigating?
  3. When did the impact start? Can we learn anything from the timing? Is this correlated with a new deploy, or an increase in traffic volume?
  4. Does anyone have a theory as to what’s causing this? Let’s brainstorm some possible areas we can investigate.
  5. Can the affected service be safely restarted? Can someone confirm if that is safe?
  6. Do we need to escalate this and/or pull in more people to help?
Get out of the way (but still report status)
1. As Incident Manager, you need to interrupt people to get status, assign tasks, and ensure we are making progress towards a resolution. However, it is also important to provide EOC and any other engaged engineers with time to work and investigate. If someone should be heads down doing debugging or technical investigation, make sure they have space to work. Set a timer and ask them for a short update at regular intervals.
Escalate if you are not making progress
1. If you find that the current group of responders is blocked and not making acceptable progress towards a resolution, it is time to escalate. Page in additional Incident Manager and Engineering support and reach out to key individuals that may be able to help. You can also engage with leaders to help coordinate.

Additional Learning about Incident Management

Videos

A good talk about incident response from PagerDuty.
There are some video resources in the Monitoring handbook page that can be helpful like:

Reading

Understanding how to work with the CMOC: the support team has a great write up on the Communications Manager On-Call (CMOC) workflows.
The IM Onboarding handbook page
SRE Shadow blog post so you have a feel of what working with the EOC is like.
Google has a few chapters on Incident response in their SRE books:
1. Good thoughts on the life of the engineers on call.
  - Being On call
  - Effective Troubleshooting
2. Dealing with the incident:
3. Being Oncall examples
The Incident Manager Checklist in our runbooks.
If you have additional questions about the Incident Manager role, incident response or incident review, please join the #imoc_general Slack channel.

Taking on the role of Incident Manager

Checklist

Note: you can do the Access Request and start shadowing / joining channels while you wait on the Access Request

IM's Manager Setup Checklist

IM's manager reads handbook page to familiarize with new duties of their direct report along with new time-in-lieu recommendations

Instructions for shadowing an Incident Manager

The shadow process is informal, in order to provide the most flexibility and lessen overhead in managing a formal PagerDuty schedule.

Join the above mentioned slack channels and, during stretches of time during your workday where you can respond on short notice, turn the notifications for the #incident-managent channel on. (Remember to turn off the notifications at the end of your workday to avoid being pinged after hours)
1. You can set notifications to only notify on specific keywords, like "Incident Manager (IM)"
(Optional) Add yourself to the Incident Manager (Shadow) PD schedule.
- Adding yourself to the Shadow Schedule will mean that you will be paged in addition to the Incident Manager.
- We typically have about 4 incidents a month where the IM is paged, so even with a couple weeks of overrides you may not get paged for an incident.
- To create an Override, in the PagerDuty UI on the Incident Manager (Shadow) PD schedule click Schedule an Override, then click Custom duration and then select the time zone and the start and end dates and times before clicking the Create Override button to save the changes.
- If for some reason you need to remove an override, click the "x" on the override to be removed in the list of Upcoming Overrides on the right side of the screen.
When you see a sev 1 or 2 incident declared, or you see an incident declared followed by a Woodhouse message with an emoji of a phone screen and "Incident Manager (IM)", join the slack channel that is shown in the incident creation message.
If you want to know who the current IM on call is, use the @incident-managers alias in Slack.
Join the incident zoom call (link can be found in the description of an incident slack channel, or the description of the #incident-management channel), and rename your zoom name to add "IM Shadow". This will allow everyone to easily understand your role without needing to ask.
Observe the incident, and in particular the role of the Incident Manager. Write down questions that arise to ask the IM later in a debrief.
Debrief with the IM on the incident some time later to review the incident and ask any questions that arose. This can be done asynchronously via Slack, or by setting up a sync call.

If you don't have previous experience in an IM role, it would be good to shadow at least 2 incidents. If you have worked previously as an IM in previous companies, you can skip the shadowing portion of this onboarding, or shadow fewer incidents. Please seek approval from the IMOC coordinator and your manager prior to skipping the shadowing process by mentioning them in this issue and citing some of your past IM experience

Instructions for reverse shadowing an Incident Manager

When you feel ready, coordinate with an IM for a "reverse shadow". Coordinate with an IM in your time zone who has shifts coming up (check the schedule) and schedule an override for an agreed upon length of time, with the understanding that the original IM will be around for support. This way, you can handle some incidents with a fall back / escalation point should you need help. This is what we do to onboard SREs in the oncall rotation and has been very helpful for new people getting used to things.

Keep in mind that it's not uncommon for entire shifts to pass without any engagement for an IM, especially in APAC hours, so while the reverse shadow is strongly recommended, it's not a hard requirement to have been involved in incidents before officially joining the rotation.

Completing this issue

When you are ready to become an Incident Manager:

In a note on this issue, ask the IMOC coordinator to be added to the PagerDuty IM schedule

@jarv I have completed my IMOC onboarding and wish to be added to the PagerDuty IM schedule

Mention your manager in this issue
Apply the IM-OnboardingReady label to this issue and close it

Edited May 06, 2024 by John Jarvis