Skip to content

Discuss turning on auto-incident creation for alerts

Context

incident.io offers the ability to have alerts automatically create incidents. Historically we have avoided wanting to do this for a few reasons:

  1. Some alerts resolve quickly
  2. It creates process overhead to have to manage incidents
  3. During S1s there can often be many pages related to a single incident

The downsides of this are:

  1. We do not get data on how often alerts are being turned into incidents
  2. incidents are not auto-populated with details from the alert, causing incident responders to have to fill out this information manually
  3. EOCs get paged for every alert going off during an incident, instead of having an opportunity to decide the new alerts are part of the existing incident.

With incident.io, they offer the concept of triage incidents that along with alert grouping may alleviate this concern. image. Triage incidents will help because they are not full-blown incidents. They do not have to update status pages, they do not create GitLab incident issues, they are just light-weight slack channels that give EOCs a space and time to investigate the alert. If the alert represents a real problem, they can accept the triage incident into an active incident. If the alert is not real, or the alert resolves on its own, the triage incident will be cancelled.

Proposal

  1. Stop paging directly from pagerduty
  2. Configure alert manager to open triage incidents for any alert
  3. When a triage incident is opened by an alert, incident.io should page the EOC the alert would otherwise have paged in pagerduty.
  4. If any subsequent alerts fire, rely on grouping to notify the EOC in the incident channel. If the EOC indicates the alert is part of the current incident, they will not get paged and no new incident will openend.
  5. If the EOC doesn't indicate that the alert is related, after the defined period of time it will create a new triage incident.
  6. For any of the incidents created, EOC can allow the alert to resolve to auto-cancel triage incidents, or they can accept the triage incident to turn it into a live and active incident.
Edited by Kam Kyrala