Incident Manager Onboarding - Chad Woolley
This issue is for training and onboarding to be a GitLab Incident Manager.
IM Onboarder Details
- Onboarder - GitLab username
- Ideal shift:
17:00 UTC - 23:00 UTC
- shift (see handbook for details) - Add IM-OnboardingTraining when you begin working through this issue
Companion learning resource
- We have created a Incident Manager Training LevelUp course to be a companion resource to this onboarding issue. We are continually seeking to improve the offering, so please do give us feedback here if you see things that could be better.
- There is also an extensive Frequently Asked Questions section in the Incident Manager onboarding handbook page
The goal of incident response:
The goal of the incident response process is to mitigate customer and/or business impact and restore service to its previous condition. We should favor mitigating the impact over understanding the underlying cause.
Example: The Engineer On-Call may decide to initiate a rollback to a known good version, even if the underlying cause of the problem is unknown.
There are some basic principles of incident response you should be aware of:
- Declare incidents early and often.
- Designate clearly defined roles.
- Maintain a clear line of command.
- Keep a working record of debugging and mitigation as you go.
- Facilitate smooth hand-offs between responders coming and going on/off shift.
Source: https://sre.google/workbook/incident-response/
What is the role of Incident Manager?
An Incident Manager:
- Commands and coordinates the incident response, delegating roles as needed.
- Communicates effectively to the team and works with the Communication Manager on Call(CMOC) for external and executive communication.
- Stays in control of the incident response.
- Facilitates other responders so they can mitigate and resolve the incident.
Source: https://sre.google/workbook/incident-response/
What is not part of Incident Manager role?
An Incident Manager should not be directly engaging in debugging, troubleshooting, or creating technical fixes. These activities should be delegated to the GitLab Engineer On-Call (EOC) and other responding engineers so the Incident Manager can maintain operational awareness, communicate status, and coordinate the response.
See also the description of Roles and Responsibilities in our Incident Management documentation.
How does an Incident Manager effectively engage with the Engineer On-Call?
During a high-profile and high-impact incident (e.g severity 1), one of your primary responsibilities as Incident Manager is to help lower the stress levels of the Engineer On-Call.
Some methods that can be employed to accomplish this:
- Act as a servant leader; ask the EOC what they need.
- Example: Do you need me to bring in someone from team X / dev escalations / CMOC?
- Manage the incident room to keep interruptions to a minimum.
- The purpose of the incident Zoom is for coordinating the technical investigation and mitigation.
- Assist the CMOC with drafting status page updates off-call in a Slack thread, to keep the Zoom call focused.
- If Directors or VPs join the incident call, direct their questions to Slack or a separate Zoom call.
- Interrupt the EOC to ask for an update or a clarification when needed.
- When the investigation is ongoing, get out of the way.
- Manage the incident issue.
- During a user-facing incident we may get lots of user reports and other comments on the incident issue.
- When this happens, you will probably want to lock the incident issue to keep the information focused.
- This helps the EOC and everyone working on the incident to understand the current status and ongoing threads of investigation.
- Avoiding audience participation can also help lower the stress of everyone involved.
What does an Incident Manager do during an incident?
- Don’t Panic
- Incident management can sometimes feel stressful. Don’t panic. Assemble the team of people you need to support you, follow the process, and don’t panic.
- Clearly Communicate Current Status
- In the early stages of an incident, ask for an update from the EOC and any other engaged engineers every 20-30 mins.
- Ask the Communications Manager On-Call (CMOC) for an update on new customer reports if we don't yet have clear data.
- Screenshot (or ask others to screenshot) charts showing changes in impact.
- Report a status updates summary regularly (every 15-20 min) in the incident Slack channel. These updates provide critical information that help the team members across the company coordinate our response
- If You’re Stuck Ask Probing Questions
- If the team of responders is stuck and not sure what to do, you should ask probing questions to help unblock the team’s thinking. Assign people to investigatory tasks as you generate ideas.
- Some example probing questions:
- What is the current impact on users? Is the service unavailable, slow, partially available?
- Can we rollback to a known good version? Can someone confirm if that is safe? Even if we don’t yet know exactly what’s causing this could a rollback restore service while we continue investigating?
- When did the impact start? Can we learn anything from the timing? Is this correlated with a new deploy, or an increase in traffic volume?
- Does anyone have a theory as to what’s causing this? Let’s brainstorm some possible areas we can investigate.
- Can the affected service be safely restarted? Can someone confirm if that is safe?
- Do we need to escalate this and/or pull in more people to help?
- Get out of the way (but still report status)
- As Incident Manager, you need to interrupt people to get status, assign tasks, and ensure we are making progress towards a resolution. However, it is also important to provide EOC and any other engaged engineers with time to work and investigate. If someone should be heads down doing debugging or technical investigation, make sure they have space to work. Set a timer and ask them for a short update at regular intervals.
- Escalate if you are not making progress
- If you find that the current group of responders is blocked and not making acceptable progress towards a resolution, it is time to escalate. Page in additional Incident Manager and Engineering support and reach out to key individuals that may be able to help. You can also engage with leaders to help coordinate.
Additional Learning about Incident Management
Videos
- A good talk about incident response from PagerDuty.
- There are some video resources in the Monitoring handbook page that can be helpful like:
Reading
- Understanding how to work with the CMOC: the support team has a great write up on the Communications Manager On-Call (CMOC) workflows.
- The IM Onboarding handbook page
- SRE Shadow blog post so you have a feel of what working with the EOC is like.
- Google has a few chapters on Incident response in their SRE books:
- Good thoughts on the life of the engineers on call.
- Dealing with the incident:
- Being Oncall examples
- The Incident Manager Checklist in our runbooks.
- If you have additional questions about the Incident Manager role, incident response or incident review, please join the #imoc_general Slack channel.
Taking on the role of Incident Manager
Checklist
Note: you can do the Access Request and start shadowing / joining channels while you wait on the Access Request
-
Request the following via an Access Request: -
Responder
account on GitLab's PagerDuty account (https://gitlab.pagerduty.com/escalation_policies#PO2KR8R) -
Developer
role for https://ops.gitlab.net/gitlab-org/quality/ -
Member
of the IMOC Google Group: https://groups.google.com/a/gitlab.com/g/imoc
-
-
The PagerDuty App on your phone -
Make sure to turn on the On-Call Boosters setting in the app to get notified when you are added to or removed from a schedule, or changes are made to your shifts. -
After installing, you can send a test notification through your profile in the PagerDuty web app.
-
-
Update your GitLab.com account profile so that your email address is set under the "Public email" option. This is required so that you can be properly tagged on incidents. -
Join the following channels on Slack: #incident-management
,#production
,#feed_alerts-general
,#abuse
,#dev-escalation
,#imoc_general
-
Make sure you can login to the dashboards site -
Make sure you can login to elastic -
Familiarize yourself with the dev escalation process -
Make sure you can login to https://ops.gitlab.net/ (Sign in with Google) -
Subscribe to the IMOC shared calendar. Check your plans a few weeks in advance compared to the schedule. -
Update your notifications for going on-call in Pager Duty > My Profile > Notifications > Before I go on-call
. You can get a push or email a few days before your shift starts. -
Create at least 4 consecutive days of overrides in the Incident Manager (Shadow) PD schedule during your preferred working hours. -
Shadow the current Incident Managers (see detailed instructions below). -
(Optional) Reverse shadow an Incident Manager (see detailed instructions below). -
Debrief after first incident that you shadowed or reverse-shadowed with the Incident Manager and discuss any additional training needs are identified.
IM's Manager Setup Checklist
-
IM's manager reads handbook page to familiarize with new duties of their direct report along with new time-in-lieu recommendations
Instructions for shadowing an Incident Manager
The shadow process is informal, in order to provide the most flexibility and lessen overhead in managing a formal PagerDuty schedule.
- Join the above mentioned slack channels and, during stretches of time during your workday where you can respond on short notice, turn the notifications for the #incident-managent channel on. (Remember to turn off the notifications at the end of your workday to avoid being pinged after hours)
- You can set notifications to only notify on specific keywords, like "Incident Manager (IM)"
- (Optional) Add yourself to the Incident Manager (Shadow) PD schedule.
- Adding yourself to the Shadow Schedule will mean that you will be paged in addition to the Incident Manager.
- We typically have about 4 incidents a month where the IM is paged, so even with a couple weeks of overrides you may not get paged for an incident.
- To create an Override, in the PagerDuty UI on the Incident Manager (Shadow) PD schedule click Schedule an Override, then click Custom duration and then select the time zone and the start and end dates and times before clicking the Create Override button to save the changes.
- If for some reason you need to remove an override, click the "x" on the override to be removed in the list of Upcoming Overrides on the right side of the screen.
- When you see a sev 1 or 2 incident declared, or you see an incident declared followed by a Woodhouse message with an emoji of a phone screen and "Incident Manager (IM)", join the slack channel that is shown in the incident creation message.
- If you want to know who the current IM on call is, use the
@incident-managers
alias in Slack. - Join the incident zoom call (link can be found in the description of an incident slack channel, or the description of the #incident-management channel), and rename your zoom name to add "IM Shadow". This will allow everyone to easily understand your role without needing to ask.
- Observe the incident, and in particular the role of the Incident Manager. Write down questions that arise to ask the IM later in a debrief.
- Debrief with the IM on the incident some time later to review the incident and ask any questions that arose. This can be done asynchronously via Slack, or by setting up a sync call.
If you don't have previous experience in an IM role, it would be good to shadow at least 2 incidents. If you have worked previously as an IM in previous companies, you can skip the shadowing portion of this onboarding, or shadow fewer incidents. Please seek approval from the IMOC coordinator and your manager prior to skipping the shadowing process by mentioning them in this issue and citing some of your past IM experience
Instructions for reverse shadowing an Incident Manager
When you feel ready, coordinate with an IM for a "reverse shadow". Coordinate with an IM in your time zone who has shifts coming up (check the schedule) and schedule an override for an agreed upon length of time, with the understanding that the original IM will be around for support. This way, you can handle some incidents with a fall back / escalation point should you need help. This is what we do to onboard SREs in the oncall rotation and has been very helpful for new people getting used to things.
Keep in mind that it's not uncommon for entire shifts to pass without any engagement for an IM, especially in APAC hours, so while the reverse shadow is strongly recommended, it's not a hard requirement to have been involved in incidents before officially joining the rotation.
Completing this issue
When you are ready to become an Incident Manager:
-
In a note on this issue, ask the IMOC coordinator to be added to the PagerDuty IM schedule
@jarv I have completed my IMOC onboarding and wish to be added to the PagerDuty IM schedule
-
Mention your manager in this issue -
Apply the IM-OnboardingReady label to this issue and close it