Incident Manager Onboarding - Crystal Poole
This issue is for training and onboarding to be a GitLab Incident Manager.
Introduction to Incident Manager
The goal of incident response:
-
The goal of the incident response process is to mitigate customer and/or business impact and restore service to its previous condition. We should favor mitigating the impact over understanding the underlying cause. -
Example: An Incident Manager may decide to initiate a rollback to a known good version, even if the underlying cause of the problem is unknown.
There are some basic principles of incident response you should be aware of:
- Maintain a clear line of command.
- Designate clearly defined roles.
- Keep a working record of debugging and mitigation as you go.
- Declare incidents early and often.
Source: https://sre.google/workbook/incident-response/
What is the role of Incident Manager?
An Incident Manager:
-
Commands and coordinates the incident response, delegating roles as needed. -
Communicates effectively. -
Stays in control of the incident response. -
Works with other responders to resolve the incident.
Source: https://sre.google/workbook/incident-response/
What is not part of Incident Manager role?
-
An Incident Manager should not directly engaging in debugging, troubleshooting, or creating technical fixes. These activities should be delegated to the EOC and other responding engineers so the Incident Manager can maintain operational awareness, communicate status, and coordinate the response.
See also the description of Roles and Responsibilities in our Incident Management documentation.
What does an Incident Manager do during an incident?
-
Don’t Panic - Incident management can sometimes feel stressful. Don’t panic. Assemble the team of people you need to support you, follow the process, and don’t panic.
-
Clearly Communicate Current Status - In the early stages of an incident ask for an update from the EOC and any other engaged engineers every 5-10 mins.
- Ask the CMOC for an update on new customer reports every 5-10 mins.
- Screenshot (or ask others to screenshot) charts showing changes in impact.
- Report a status updates summary regularly (every 5-30 min) in the incident slack channel. These updates provide critical information that help the team members across the company coordinate our response
-
If You’re Stuck Ask Probing Questions - If the team of responders is stuck and not sure what to do you should ask probing questions to help unblock the team’s thinking. Assign people to investigatory tasks as you generate ideas.
- Some example probing questions:
- What is the current impact on users? Is the service unavailable, slow, partially available?
- Can we rollback to a known good version? Can someone confirm if that is safe? Even if we don’t yet know exactly what’s causing this could a rollback restore service while we continue investigating?
- When did the impact start? Can we learn anything from the timing? Is this correlated with a new deploy, or an increase in traffic volume?
- Does anyone have a theory as to what’s causing this? Let’s brainstorm some possible areas we can investigate.
- Can the affected service be safely restarted? Can someone confirm if that is safe?
- Do we need to escalate this and/or pull in more people to help?
-
Get out of the way (but still report status) - As Incident Manager you need to interrupt people to get status, assign tasks, and ensure we are making progress towards a resolution. However it is also important to provide EOC and any other engaged engineers with time to work and investigate. If someone should be heads down doing debugging or technical investigation make sure they have space to work. Set a timer and ask them for a short update at regular intervals.
-
Escalate if you are not making progress - If you find that the current group of responders is blocked and not making acceptable progress towards a resolution it is time to escalate. Page in additional Incident Manager and EOC support and reach out to key individuals that may be able to help. You can also engage with leaders to help coordinate.
Additional Learning about Incident Management
Videos
- A good talk about incident response from PagerDuty.
Reading
- Understanding how to work with the CMOC: the support team has a great write up on the Communications Manager On-Call (CMOC) workflows.
- SRE Shadow blog post so you have a feel of what working with the EOC is like.
- Google has a few chapters on Incident response in their SRE books:
a. Good thoughts on the life of the engineers on call.
- Being On call
- Effective Troubleshooting b. Dealing with the incident:
- Emergency Response
- Managing Incidents
- Workbook examples of incident response
- Incident Review and Learning from Failure c. Being Oncall examples
- The Incident Manager Checklist in our runbooks.
- If you have additional questions about the Incident Manager role, incident response or incident review, please join the #imoc_general Slack channel.
Getting going as an Incident Manager
After the reading and video above, you should start shadowing existing Incident Managers.
Things you will need:
-
An account on GitLab's PagerDuty account (via Access Request) -
The PagerDuty App on your phone -
Join the following channels on Slack: #incident-management, #production, #feed_alerts-general, #abuse, #dev-escalation -
Make sure you can login to the dashboards site -
Make sure you can login to kibana -
Familiarize yourself with the dev escalation process -
Make sure you can login to https://ops.gitlab.net/