Incident Metrics: A Visual Guide
Why is this change being made?
Incident Management Metrics: A Visual Guide - 1... (#13986 - closed) is part two
in a three
part series for Blog Post Series for GitLab Incident Management... (&1945).
Metrics to include:
- First product impact: The first moment of severe impact to the product
start time
- Mean time to detection (MTTD): When the operator becomes aware of the problem.
impact detected
- Service Level Agreement (SLA): time frames in which you can expect the first response. SLA times are not considered as an expected time to resolution.
response initiated
- Severity
- severity1: Service is unavailable or completely unusable (30 Minutes)
- severity2: Service is highly degraded, there is no work around and there is a significant business impact (4 hours)
- severity3: Something is preventing normal service operation but there is a work around (8 hours)
- severity4: There are questions/ clarifications around features/ documentation that have minimal or no business impact (24 hours)
- Mean time to mitigate (MTTM): When there is no longer severe product impact. The system may still be degraded in some way.
impact mitigated
- Mean time to recovery (MTTR): When the system has fully recovered and is operating normally. Note: Sometimes recovery and mitigation are the same, but sometimes they are different. MTTR is the same as the DORA metric
Time to restore service
: time an incident was open in a production environment over the given time period.end time
- Mean time between incidents (MTBI): The time between the full recovery of the system and the first product degradation after the incident.
Service Level Objectives (SLO): target for the proper level of reliabilityService Level Indicators (SLI): a metric that tells you how your service is operating from the perspective of your users; i.e can a user load a page quickly enough.
Sources
Edited by Alana Bellucci