Escalate alerts according to the escalation policies for a project

Once escalation policies tables and models are available for projects with on-call schedules, we will want to escalate alerts according to the rules of the escalation policy for the project. This issue represents the work needed to actually adhere to the policy defined for a project.

Scope/requirements of this issue:

  1. Add support for escalating alerts when escalation policy dictates
  2. Escalations will only need to be to a provided schedule
  3. Escalations will only need to be after a given number of minutes >=0
  4. Escalations will be if the alert is not either acknowledged or resolved, as defined by user
  5. Escalation rules should not apply to alerts which were created before the escalation policy
  6. Escalation rules should be adhered to as closely as possible for the sake of user trust. If we're late, that's money.
  7. Escalations should only occur once per escalation rule, per alert.
  8. Re-triggered alerts should start the escalation policy over, as if they had just been created. ("Re-triggered" meaning that the status was set to acknowledged/resolved, then back to triggered)
  9. If an escalation policy is modified, existing alerts should follow the original escalation rules. (If using the new rules is easier, do that instead & communicate the change in expectations to Product.)

Out of scope: backfilling escalation policies, auto-creating escalation policies, system notes for escalations, email updates

Proposal:

Table: incident_management_alert_escalations

Model: IncidentManagement::AlertEscalations

Column Required Type 
 Description
id true Integer ID of the escalation
policy_id true Integer Escalation Policy to which the escalation corresponds
alert_id true Integer ID of the alert
created_at true datetime_with_zone Creation time of the escalation (AKA - time at which the escalation was "triggered")
updated_at true datetime_with_zone Update time of the escalation (AKA - time at which notifications were last sent out)

Flow:

  1. An alert comes in.
  2. An escalation policy is identified.
  3. Any zero-minute escalation rules are enacted.
  4. An Escalation is added to the incident_management_escalations table.
  5. A cronjob runs every minute, starting a job for each Escalation.
  6. Job content:
    • Get Escalation.
    • Get job start_time.
    • Get alert. Get policy & rules.
    • Filter to applicable rules.
      • alert.status >= escalation_rule.status (the status isn't expectedly resolved/ack-ed)
      • (escalation.current_time - escalation.created_at) >= escalation_rule.time_elapsed (it's been too long)
      • (escalation.updated_at - escalation.created_at) < escalation_rule.time_elapsed (we haven't already notified for this rule)
    • For each applicable rule, send notifications.
    • Set escalation.updated_at to job start_time.
  7. On status change of alert or incident to Resolved, remove the Escalation. On status change of an alert from Resolved to anything else, create an Escalation.

When the same alert keeps firing:

  • Notify when new alerts arrive and on escalations only.
  • New alerts trigger the escalation policy, sending one notification per rule.
  • Re-occurrences of existing alerts do nothing extra, but the alert will continue be escalated according to the escalation policy. (EX - An alert was created 16 minutes ago. There are escalation rules for 0, 10, & 30 minutes. We've already sent out a notification at 0 minutes and another at 10 minutes. Now, the alert integration receives the same payload again, but we do nothing.)

Validations/constraints:

  • escalation, alert should be present
  • Unique constraint: Combo of policy_id, alert_id should be unique
Edited by Sean Arnold