Incident Management

Introduction

Infrastructure’s highest priority is to ensure availability and performance of GitLab.com. In order to achieve this goal, the chaos that is introduced whenever changes take place in the production environment must be managed. These changes include both those that are purposely made to the environment through maintenance work and deployments (which are handled through Change Management), and those that happen as a result of failures in infrastructure components (which are handle through Incident Management).

Managing Chaos

Managing chaos is both feasible and attainable. As with any other endeavor we undertake, it entails structure and discipline in our approach to manage the environment, especially as a team. Initial approach defines:

Incident Severities
Roles
Dedicated Communication Channels

Incident Severities

Incident severities encapsulate the impact of an incident and scope the resources allocated to handle it. Detailed definitions must be provided for each severity, and these definitions must be reevaluated as new circumstances become known.

Severity	Definition and Examples
~S1	Incident has a significant impact on all of GitLab.com.
	Examples of ~S1 incidents include complete blackout of the site, significant degradation of performance, a high rate of errors visible to end users, users unable to sign up or sign in.
~S2	Incident has a significant impact on important portions of GitLab.com.
	Examples of ~S2 incidents include particular features not working across all or most repositories (issues, discussions, etc) but where the repositories themselves are still accessible. A Sev1 incident is an incident that has the potential of becoming a Sev0 incident if not mitigated expediently.
~S3	Incident has a limited impact on important portions of GitLab.com.
	Examples of ~S3 incidents include particular features not working across a small number of repositories.
~S4	Incident has a limited impact on...
	Examples of ~S4 incidents include ...

Rules

When a closed Sev2 or Sev3 incident is followed by another Sev2 or Sev3 incident within 3 hours, the latter incidents are automatically upgraded to Sev2 incidents.

Alerts

Alerts severities do not necessarily determine incident severities. A single incident can trigger a number of alerts at various severities, but the determination of the incident's severity is driven by the above definitions.
Over time, we aim to automate the determination of an incident's severity through service-level monitoring that can aggregate individual alerts against specific SLAs.

Roles

Whenever an incident takes place, our nature draws us help resolve the issue as quickly as possible. As the complexity and size of the environment increases, it becomes necessary to orchestrate activities during an incident. In order to properly orchestrate work during an incident, roles must be defined so that every resource available is used effectively and efficiently.

Role	Definition and Examples
`IMOC`	Incident Manager
	The Incident Manager is the tactical leader of the incident response team, and it must not be the person doing the technical work resolving the incident. The IMOC sssembles the Incident Team, evaluates data (technical and otherwise) coming from team members, evaluates technical direction of incident resolution and coordinates troubleshooting efforts, and is responsible for documentation and debriefs after the incident.
`CMOC`	Communications Manager
	The Communications Manager is the communications leader of the incident response team. The focus of the Incident Team is on resolving the incident as quickly as possible. However, there is a critical need to disseminate information to interested parties, including other employees, eStaff, and end users. For Sev0 (and possibly Sev1) incidents, this is a dedicated role. Otherwise, IMOC can handle communications.
`OCIT`	On-Call + Incident Team
	The Incident Team is primarily composed of the on-call person. However, the Incident Manager can call in additional resources as necessary.

These definitions imply several on-call rotations for the different roles. The IMOC should be a technical person with a good understanding of GitLab.com's architecture. The CMOC is not required to be technical. The IMOC and the CMOC work in tandem to manage the incident and communicate appropriately to all necessary audiences (end-users, customers, eStaff and employees).

Communication Channels

Information is a key asset during an incident. Properly managing the flow of information to its intended destination is critical to both the resolution of the incident and to keep stakeholders apprised of developments in a timely fashion.

This flow is determined by

the type of information,
its intended audience,
and timing sensitivity.

Furthermore, avoiding information overload is necessary to keep every stakeholders focus.

To that end, we should:

Have a dedicated incident bridge (zoom call) so that the Incident Team has a well-known destination to join the team.
Have a dedicated #incident channel, since #production contains large amounts of information and it takes effort to filter out non-relevant items depending on the audience. This is particularly important for the Incident Team, which must be focused on technical information to resolve the incident. While #incident is an open channel, we will encourage people to use other channels to communicate with the IMOC so that to keep the Incident Team focused on resolving the incident.
Have periodic (on the order of 15 minutes) updates intended to the various audiences at place (CMOC handles this):
- End-users (Twitter)
- eStaff
- Support staff
- Employees at large
- A dedicated repo for issues related to Production separate from the queue that holds Production Engineering’s work: namely, issues for incidents and changes. This is useful because there may be other teams that need to do work in the production environment, and their changes and incidents should be tracked accordingly.

Incident Management Runbooks

With severities, roles and communication channels defined, we can start to develop runbooks to help us manage incidents and expectations around them, which is the next step in developing solid Incident Management. The severity of an incident determines, for instance, how often we communicate with what stakeholders.

This issue expands the concepts outlined on infrastructure/issues/4264.

Edited Jun 17, 2018 by Gerardo Lopez-Fernandez

Admin message