How to page & acknowledge manually created incidents

Problem to be solved

Today, Incident management is set up to trigger escalation policies for new alerts. In this scenario, the on-call responder who is paged can end the paging by acknowledging the alert by changing it's status to from triggered to acknowledged. If the responder changes the status back, we restart the escalation policy and begin paging again.

When a user creates an incident manually, there is no associated alert.

We need to enable paging on incidents and the ability for a responder to "acknowledge" and end paging for a manually created incident AND to "un-acknowledge" or restart paging on an incident for a different escalation policy or user.

Things to figure out

  • Today, escalation policies are only triggered be alerts. We need to adapt this to also allow escalation policies be be triggered by Incidents
  • Incidents only have two statuses: OPEN or CLOSED - we will need to figure out how to allow a user to acknowledge and un-acknowledge and incident

Intended Users

User Experience

User creates an incident and selects the Escalation Policy or user to trigger paging for that Incident. On-call responder can "acknowledge" an incident to indicate that they are working on the incident and to end paging.

Design

Introduce sidebar items that surface Status and Escalation Policies:

Status and escalation policy sidebar items Status dropdown options More information about what changing the status does Escalation policy dropdown No escalation policies created No escalation policies created - expanded Hidden edit button for those without proper permissions System note for Paging status changes Surface status on incident list
Status_sidebar_item Status_dropdown Tooltip Paging_dropdown Paging_dropdown_-_no_policies Paging_dropdown_-_no_policies_expanded Non-developers System_note_3 Surface_status_on_incident_list

Notes:

  • Only developers and up will be able to edit the incident status or escalation policy. Reporters and non-project members will have the Edit button hidden.
  • Changing the incident status to acknowledged or resolved will stop paging according to the specified escalation policy. On the other hand, changing the status from acknowledged or resolved to triggered will re-start paging.
  • If an incident was created from an alert, the alert and incident statuses will be mirrored (so, an ACKed alert will become and ACKed incident).
  • For incidents created from alerts - if an escalation policy has been created for the project, the escalation policy will be pre-populated when the incident is created. For manual incidents, the escalation policy needs to be defined manually.

Email to users when paged:

An incident has been triggered in [group/project].

View incident details

Title: [Insert title here]

Description: [Insert description here]

Escalation policy: [Insert escalation policy, if present]

Metric: [Insert metric, if available]*

[Metric could be a string or a link, up to the discretion of the engineer implementing this issue. Longer-term, we'll likely include a png of the metric but that's out of scope for the first iteration.]

Figma file

Technical Implementation Plan

  • Assumption! When escalation policies are changed, existing alerts and incidents will be escalated according to the previous policy. If the status is updated to triggered, the new policy will be applied instead.

The plan below has 3 steps. Part 1 blocks part 2, which blocks part 3. 3A-C can be completed in parallel.

Part 1: Add table/model for IssuableEscalationStatus. backend

Scope:

  • Add new table.
  • Add new model.
  • Add IssuableEscalationStatus has_one associations to issues.
Column Required Type 
 Description
id true Integer ID of the object
issue_id true Integer Incident which has an escalation status
escalation_policy_id false Integer Incident which has an escalation status
status Integer One of AlertManagement::Alert::STATUSES

Validations/constraints:

  • issue, status should both be present
  • escalation_policy should be in the same project as issue, if present
  • status should be in AlertManagement::Alert::STATUSES
  • Unique constraint: issue_id should be unique

Part 2: Add escalation support for incidents. backend

Scope:

  • Auto-create an IssuableEscalationStatus for new incidents without an associated alert.
  • Escalate incidents w/o alerts based on IssuableEscalationStatus according to escalation policy, per approach in https://gitlab.com/gitlab-org/monitor/monitor/-/issues/56#note_538327473.
  • Add new email to be sent when paging on incident.
  • When the issue-type is changed, delete an existing IssuableEscalationStatus.
  • When the incident is moved to another project, set the escalation_policy_id to null, reset the status to Triggered.

Blocked by: #323139 (closed)

Part 3A: Add slash command for escalating an incident. backend

Scope:

  • Add new slash command /page <escalation-policy>

Slash command should require an escalation policy from the project as an argument. It should only be available on incidents. It should only be available to developer+. It should not be available on incidents with an associated alert.

If paging has already begun for an escalation policy, reset the status to Triggered, change the escalation_policy_id and re-start escalating on the new policy.

Part 3B: Add Status dropdown in the UI. backend frontend

Scope:

  • Dropdown should include same options as the Alert status dropdown. frontend
  • Expose IssuableEscalationStatus in GraphQL API. backend

UI should show alert's statuses if the incident is associated with an alert. In this scenario, changing the status should update the status of the alert.

Setting the status of an incident to Triggered should reset the Escalations for the incident.

Part 3C: Add Escalation policy dropdown in the UI. backend frontend

Scope:

  • Dropdown should include all the escalation policies available for the project. frontend
  • Expose EscalationPolicies in GraphQL API. backend

If applicable, UI should show the currently firing escalation policy. Setting the escalation policy should start notifications on that policy for the incident.

Edited by Amelia Bauerly