Commit 329206c6 authored by Kam Kyrala's avatar Kam Kyrala 🌴
Browse files

Consolidate incident management handbook pages

parent 04477d83
Loading
Loading
Loading
Loading
+1 −1
Original line number Diff line number Diff line
@@ -187,7 +187,7 @@
/content/handbook/engineering/infrastructure/team/cloud-connector/ @pjphillips
/content/handbook/engineering/infrastructure-platforms/gitlab-delivery/distribution/ @marin @denisra @mbursi
/content/handbook/engineering/infrastructure-platforms/emergency-change-processes.md @marin @bill_staples
/content/handbook/engineering/infrastructure-platforms/incident-management/ @dawsmith @rnienaber @marin @sabrams @jtoto-gtl
/content/handbook/engineering/infrastructure-platforms/incident-management/ @dawsmith @rnienaber @marin @sabrams @jtoto-gtl @jscarborough @fviegas
/content/handbook/engineering/infrastructure-platforms/service-maturity-model.md @rnienaber @marin @bill_staples
/content/handbook/engineering/infrastructure/team/ @marin
/content/handbook/engineering/infrastructure-platforms/andrew-newdigate.md @andrewn
+3 −65
Original line number Diff line number Diff line
---
title: "Incident"
title: "Incident Management"
redirect_to: /handbook/engineering/infrastructure-platforms/incident-management/
---

## Definition of an Incident

The definition of "incident" can vary widely among companies and industries. Here at GitLab, incidents are **anomalous conditions** that result in — or may lead to — service degradation, outages, or other disruptions. These events require human intervention to avert disruptions, communicate status, restore normal service, and identify future improvements.
Incidents are _always_ given immediate attention.

## Incident Management

Incident Management is the process of responding to, mitigating, and documenting an incident. At GitLab, we approach Incident Management as a feedback loop with the following steps, with different teams adjusting them as needed:

### 1. Preparation

The first step in an effective Incident Management program is preparation. This includes documentation of process and relevant training for everyone who could be involved in an incident. This step also includes ensuring the appropriate monitoring and alerting is in place, and the right people are part of the on-call rotation. The people involved in the preparation phase do not necessarily align with the roles for steps 2 - 7 defined below.

### 2. Identification

The various paths of identifying a problem include:

- Instrumentation/alerting/monitoring
- Customer reports
- Team member reports
- Security reports or other threat intelligence

Once a problem has been identified by one or more of the above paths, an incident is declared.

### 3. Investigation

Investigation includes looking for the cause of an outage/service disruption and an initial determination of the impact of the incident, which informs the severity level the incident is declared at. Severity levels may be changed as impact is further revealed in later stages.

### 4. Containment

Containing the impact and stabilizing the service as quickly as possible. Once containment is achieved and the impact of the disruption is alleviated, the incident is considered "mitigated".

### 5. Remediation

A more robust response to stabilizing the service. An incident is considered remediated, or "resolved", when all anomalous conditions are resolved.

### 6. Recovery

Improvements are made across testing and documentation, based on the specific containment and remediation actions taken for the incident. Corrective Actions, that have been identified to either prevent the incident from re-occurring or significantly improve future Time to Detection, Time to Mitigation or Time to Resolution of the Incident, may be started in this phase.

### 7. Learnings

This includes doing Root Cause Analysis, Incident Reviews/retrospectives, and identifying and implementing further Corrective Actions.
All of these should feed into updating documentation and training in step 1 and close the feedback loop.

Learning from incidents individually is important. Incidents should also be reviewed holistically to identify trends and learnings in order to improve the organization's posture, processes, and product.

## Incident roles

|  Role  |  Responsibilities |
| ------ | ----------------- |
| Incident Manager (IM) | Coordination of efforts across roles and teams. |
| Engineer On Call (EOC) | Responding to pages and conducting initial triage and investigation. |
| Communications Manager On Call (CMOC) | External communications through various channels. |

## On-call Schedule Management

For most on-call schedule management, GitLab uses [Incident.io](https://app.incident.io/gitlab/dashboard) to create schedules and set escalation policies.

We also employ a [Development Escalation Process](/handbook/engineering/development/processes/infra-dev-escalation/process/) to get expertise from development teams as needed.

## How we monitor and alert GitLab

[Here](/handbook/engineering/monitoring/) is an overview on our monitoring. We use an in house tool to alert when a service is in breach of its SLI or SLO, which will also connect to incident.io.
This page has moved to [Incident Management](/handbook/engineering/infrastructure-platforms/incident-management/).
+17 −1
Original line number Diff line number Diff line
---
title: Incident Management
aliases:
  - /handbook/engineering/incident-management/
---

{{% alert color="warning" %}}
@@ -39,6 +41,20 @@ containing a link to a per-incident Slack channel for text based communication.
Within the incident channel, a per-incident Zoom link will be created.
Additionally, a GitLab issue will be opened in the [Production tracker](https://gitlab.com/gitlab-com/gl-infra/production)

### Incident Management Lifecycle

At GitLab, we approach Incident Management as a feedback loop with the following steps:

1. **Preparation** — Documentation of process and relevant training for everyone who could be involved in an incident. This includes ensuring the appropriate monitoring and alerting is in place, and the right people are part of the on-call rotation.
1. **Identification** — Identifying a problem through instrumentation/alerting/monitoring, customer reports, team member reports, or security reports. Once identified, an incident is declared.
1. **Investigation** — Looking for the cause of an outage/service disruption and an initial determination of the impact, which informs the severity level.
1. **Containment** — Containing the impact and stabilizing the service as quickly as possible. Once containment is achieved, the incident is considered "mitigated".
1. **Remediation** — A more robust response to stabilizing the service. An incident is considered remediated, or "resolved", when all anomalous conditions are resolved.
1. **Recovery** — Improvements are made across testing and documentation. Corrective Actions that have been identified to prevent the incident from re-occurring or improve future response times may be started in this phase.
1. **Learnings** — Root Cause Analysis, Incident Reviews/retrospectives, and identifying further Corrective Actions. All of these feed into updating documentation and training in step 1, closing the feedback loop.

For an overview of how we monitor and alert, see the [monitoring handbook page](/handbook/engineering/monitoring/). We also employ a [Development Escalation Process](/handbook/engineering/development/processes/infra-dev-escalation/process/) to get expertise from development teams as needed.

### Scheduled Maintenance

Scheduled maintenance that is a `C1` should be treated as an undeclared incident.
@@ -101,7 +117,7 @@ On-Call rotations notified by automated systems:

| **Team** | **Primary Role** | **Function** | **Environment** | **Who?** |
| ---- | ---- | ----------- | ---- | ---- |
| **Engineer On Call (EOC)** | [Incident Responder](./roles/incident-responder.html)| Primarily serves as the initial Incident Responder to automated alerting, and GitLab.com escalations - expectations for the role are in the [Handbook for oncall](/handbook/engineering/on-call/#expectations-for-on-call). The checklist for the EOC is in our [runbooks](https://gitlab.com/gitlab-com/runbooks/blob/master/on-call/checklists/eoc.md). There are runbooks designed to help EOC troubleshoot a broad range of issues - in the case where the runbooks are insufficient, the EOC will escalate by [engaging the Incident Manager and CMOC](#how-to-engage-response-teams). | GitLab.com | Generally an SRE and can declare an incident. Part of the "GitLab.com Production EOC" on call schedule in incident.io. |
| **Engineer On Call (EOC)** | [Incident Responder](./roles/incident-responder.html)| Primarily serves as the initial Incident Responder to automated alerting, and GitLab.com escalations - expectations for the role are in the [Handbook for oncall](/handbook/engineering/infrastructure-platforms/incident-management/on-call/#general-expectations-for-on-call). The checklist for the EOC is in our [runbooks](https://gitlab.com/gitlab-com/runbooks/blob/master/on-call/checklists/eoc.md). There are runbooks designed to help EOC troubleshoot a broad range of issues - in the case where the runbooks are insufficient, the EOC will escalate by [engaging the Incident Manager and CMOC](#how-to-engage-response-teams). | GitLab.com | Generally an SRE and can declare an incident. Part of the "GitLab.com Production EOC" on call schedule in incident.io. |
| **Incident Manager On Call (IMOC)** |[Incident Lead](./roles/incident-lead.html) | Provides tactical coordination and leadership during complex incidents | GitLab.com | Rotation in [incident.io](https://app.incident.io/gitlab/on-call/schedules/01K77XZFD7X7E3W8T6GDVMKAFF) |

In low severity incidents, paged individuals may play multiple roles. For example, in an S4 incident the EOC may both perform the duties of the Incident Lead and Incident Responder. As severity increases, it becomes more important to have single individuals playing these roles; individuals in [Tier 2](#tier-2) will need to be paged.
+2 −2
Original line number Diff line number Diff line
@@ -74,7 +74,7 @@ If your eligibility status changes or you have been exempted from Incident Manag

### Starting your on-call shift

Before your shift starts, verify your Slack alerts are working and your incident.io contact is up to date. Send a test page to make sure that you are receiving alerts correctly. You may get assigned to an [on-call handover issue](/handbook/engineering/on-call/#customer-emergency-on-call-rotation) if your shift start time
Before your shift starts, verify your Slack alerts are working and your incident.io contact is up to date. Send a test page to make sure that you are receiving alerts correctly. You may get assigned to an [on-call handover issue](/handbook/engineering/infrastructure-platforms/incident-management/on-call/#customer-emergency-on-call-rotation) if your shift start time
lines up with the start of the 8 hour SRE on-call shifts.

When your on-call shift starts, you will get notification(s) that your shift is starting (email or text, depending on your incident.io preferences). You will also get a Slack notification about being added to the `@incident-managers` user group.
@@ -168,7 +168,7 @@ Example, Covering for someone. Go to the [schedule in incident.io](https://app.

### What if I am not available for my assigned shift?

Shifts are assigned based on the working hours that you selected during onboarding. Our current process is to [swap shifts](/handbook/engineering/on-call/#swapping-on-call-duty) by asking for someone to take this shift in the `#im-general` Slack channel.
Shifts are assigned based on the working hours that you selected during onboarding. Our current process is to [swap shifts](/handbook/engineering/infrastructure-platforms/incident-management/on-call/#swapping-on-call-duty) by asking for someone to take this shift in the `#im-general` Slack channel.

### What if I work a shift on a weekend or holiday?

+72 −0
Original line number Diff line number Diff line
---
title: On-Call Processes and Policies
aliases:
  - /handbook/engineering/on-call/
---

{{% alert color="warning" %}}
@@ -104,6 +106,18 @@ If you are on-call, you are expected to:

1. For team members in The Netherlands, if they cannot take an assigned shift, they must notify their rotation leader with at least 2 working days notice and the rotation leader (not the team member) is responsible for finding cover. (As agreed with the Works Council).

### General Expectations for On-Call

- If you are on call, then you are expected to be available and ready to respond to PagerDuty pages or incident.io Escalations as soon as possible, and within any response times set by our [Service Level Agreements](https://about.gitlab.com/support/#priority-support) in the case of Customer Emergencies. If you have plans outside of your workspace during your on-call shift, this may require that you bring a laptop and reliable internet connection with you.
- We take on-call seriously. There are escalation policies in place so that if a first responder does not respond in time, another team member is alerted. Such policies are not expected to be triggered under normal operations, and are intended to cover extreme and unforeseeable circumstances.
- Because GitLab is an asynchronous workflow company, @mentions of On-Call individuals in Slack will be treated like normal messages, and no SLA for response will be associated with them.
- Provide support to the release managers in the release process.
- As noted in the [main handbook](/handbook/people-group/time-off-and-absence/time-off-types/), after being on-call for Tier 1 or Tier 2 rotations, make sure that you take time off if you need to. If you have been involved in many pages and incidents during your shift, and you feel that you need to rest, please do. Resting after a stressful on-call shift is critical for preventing burnout. Be sure to inform your team of the time you plan to take for time off.
  - The expectation is that you take 1-2 days off as "time off in lieu" if you need to recover from your shift (where a shift is 5+ days).
  - The expectation is that you will communicate with your manager about the pressures of that shift so that improvements to alerts, processes, or other aspects about incident resolution can be addressed.
  - Team members in Australia should review the [Australia time in lieu policy](/handbook/total-rewards/benefits/general-and-entity-benefits/pty-benefits-australia/).
- During on-call duties, it is the team member's responsibility to act in compliance with local rules and regulations. If ever in doubt, please reach out to your manager and/or [aligned People Business Partner](/handbook/people-group/).

## Practical aspects of being on-call

1. You don’t need to install anything specific on your phone. The paging system can be set to notify you by email, phonecall, sms or (if you choose to install the app) in-app notification.
@@ -169,3 +183,61 @@ However, if Slack channels are created, ensure that they follow this format:
- `tier-2-(team-name)-rotatoin-swaps-apac`
- `tier-2-(team-name)-rotatoin-swaps-emea`
- `tier-2-(team-name)-rotatoin-swaps-amer`

## incident.io

We use [incident.io](https://app.incident.io/) to set the on-call schedules, and to route notifications to the appropriate individual(s).

### Swapping On-Call Duty

Team members covering a shift for someone else are responsible for adding the override in incident.io. This can be arranged in the [#eoc-general](https://gitlab.enterprise.slack.com/archives/C07G9CP5XRR) Slack channel or via the Request Coverage feature of incident.io. They can delegate this task back to the requestor, but only after explicitly confirming they will cover the requested shift(s). To set an override, click the "Create Override" button in the upper right of the page, or click the relevant block of time on the schedule view. This action defaults the person in the override to *you* — incident.io assumes that you're the person volunteering an override. If you're processing this for another team member, you'll need to select their name from the drop-down list. Also see [this article](https://help.incident.io/articles/2815264840-cover-me%2C-overrides-and-schedules#overrides-38) for reference.

### Adding and removing people from the roster

When adding a new team member to the on-call roster, it's inevitable that the rotation schedule will shift. The manager adding a new team member will add the individual towards the end of the current rotation to avoid changing the current schedule, if possible. When adding a new team member to the rotation, the manager will raise the topic to their team(s) to make sure everyone has ample time to review the changes.

## Slack

In order to facilitate informal conversations around the on-call process and quality of life, as well as coordination of shifts and communication of broader announcements, we have the [#eoc-general](https://gitlab.enterprise.slack.com/archives/C07G9CP5XRR) channel.

## Other Engineering On-Call Rotations

The following on-call rotations exist outside of Infrastructure Platforms but are documented here for a single reference point.

### Customer Emergency On-Call Rotation

- We do 7 days of 8-hour shifts in a follow-the-sun style, based on your location.
- After 10 minutes, if the alert has not been acknowledged, the support manager on call will be alerted. After a further 5 minutes, senior support leadership from all 3 regions will be alerted.
- All tickets that are raised as emergencies will receive [the emergency SLA](https://about.gitlab.com/support/#priority-support). The on-call engineer's first action will be to [triage the emergency request](/handbook/support/workflows/customer_emergencies_workflows/#triage-the-emergency-request) and work with the customer to find the best path forward.
- After 30 minutes, if the customer has not responded to our initial contact with them, let them know that the emergency ticket will be closed and that you are opening a normal priority ticket on their behalf. Also let them know that they are welcome to open a new emergency ticket if necessary.
- You can view the [schedule](https://gitlab.pagerduty.com/schedules#PIQ317K) and the [escalation policy](https://gitlab.pagerduty.com/escalation_policies#PKV6GCH) on PagerDuty. You can also opt to [subscribe to your on-call schedule](https://support.pagerduty.com/main/docs/schedules-in-apps#export-only-your-on-call-shifts), which is updated daily.
- After each shift, *if* there was an alert / incident, the on call person will send a hand off email to the next on call explaining what happened and what's ongoing, pointing at the right issues with the progress.
- If you need to reach the current on-call engineer and they're not accessible on Slack (e.g. it's a weekend, or the end of a shift), you can [manually trigger a PagerDuty incident](https://support.pagerduty.com/main/docs/incidents#trigger-an-incident) to get their attention, selecting **Customer Support** as the Impacted Service and assigning it to the relevant Support Engineer.
- See the [GitLab Support On-Call Guide](/handbook/support/on-call) for a more comprehensive guide to handling customer emergencies.

### Security Team On-Call Rotation

#### Security Operations (SecOps)

- SecOps on-call rotation is 7 days of 24-hour shifts.
- After 15 minutes, if the alert has not been acknowledged, the Security Manager on-call is alerted.
- You can view the [Security Operations schedule](https://gitlab.pagerduty.com/schedules#PYZC2CG) on PagerDuty.
- When on-call, prioritize work that will make the on-call better (that includes building projects, systems, adding metrics, removing noisy alerts). Much like the Production team, we strive to have nothing to do when being on-call, and to have meaningful alerts and pages. The only way of achieving this is by investing time in trying to automate ourselves out of a job.
- The main expectation when on-call is triaging the urgency of a page - if the security of GitLab is at risk, do your best to understand the issue and coordinate an adequate response. If you don't know what to do, engage the Security manager on-call to help you out.
- More information is available in the [Security Operations On-Call Guide](/handbook/security/security-operations/secops-oncall/) and the [Security Incident Response Guide](/handbook/security/security-operations/sirt/sec-incident-response/).

#### Security Managers

- Security Manager on-call rotation is 7 days of 12-hour shifts.
- Alerts are sent to the Security Manager on-call if the SecOps on-call page isn't answered within 15 minutes.
- You can view the [Security Manager schedule](https://gitlab.pagerduty.com/schedules#PJL6CVA) on PagerDuty.
- The Security Manager on-call is responsible to engage alternative/backup SecOps Engineers in the event the primary is unavailable.
- In the event of a high-impact security incident to GitLab, the Security Manager on-call will be engaged to assist with cross-team/department coordination.

### Developer Experience Stage On-Call Rotation

- Developer Experience's on-call do not include work outside GitLab's normal business hours. The process is defined on our [pipeline on-call rotation](/handbook/engineering/testing/oncall-rotation/) page.
- The rotation is on a weekly basis across 3 timezones (APAC, EMEA, AMER) and triage activities happen during each team member's working hours.
- This on-call rotation is to ensure accurate and stable test pipeline results that directly affects our continuous release process.
- The list of pipelines which are monitored are defined on our [pipeline](/handbook/engineering/testing/end-to-end-pipeline-monitoring/) page.
- The schedule and roster is defined on our [schedule](https://gitlab.com/gitlab-org/quality/pipeline-triage#dri-weekly-rotation-schedule) page.
Loading