Skip to content

Stop syncing alert and incident statuses

Sarah Yasonik requested to merge sy-stop-alert-incident-sync into master

What does this MR do and why?

Related issues: #356057 (closed), https://gitlab.com/gitlab-org/gitlab/-/issues/348676

Changes:

  • Allows the status attributes between a related incident & alert to be independently updated (removing the sync behavior)
  • Clears escalation policy attribute from any incidents which were created from alerts

Context & motivation:

  • Updating behavior to pave the way for new features
    • We're adding two new capabilities for alerts & incidents:
      • ability to link an alert to an incident after the incident has been created (currently only linkable via creating the incident from the alert)
      • ability to link one incident to multiple alerts (currently only allowed 1:1 incident:alert)
    • With the new functionality, it doesn't make sense for the incident status & alert status to automatically match, since different alerts might be resolved at different times for the same incident. And an incident may have been escalated prior to an alert being associated, so we wouldn't want to alter the escalation behavior for that incident.
  • Improving endpoint performance
    • The more actions we take for an incoming alert (like the status sync), the longer the request takes. We've been encountering timeout errors and scale issues.
    • Removing the sync behavior reduces the requirements of the alerting endpoints & helps us to improve the request performance.

Scope note: Future MRs will allow an escalation policy to be applied for any incident, and to link incidents to multiple alerts. This MR is constrained to allowing an independent incident status.

database info:

  • terminal output

    DOWN:

    % bin/rails db:migrate:down:main VERSION=20220629184402
    main: == 20220629184402 UnsetEscalationPoliciesForAlertIncidents: reverting =========
    main: == 20220629184402 UnsetEscalationPoliciesForAlertIncidents: reverted (0.0024s) 

    UP:

    % bin/rails db:migrate                                 
    main: == 20220629184402 UnsetEscalationPoliciesForAlertIncidents: migrating =========
    main: == 20220629184402 UnsetEscalationPoliciesForAlertIncidents: migrated (0.0421s) 
  • sql queries
    # Batching
    SELECT "incident_management_issuable_escalation_statuses"."id" 
    FROM "incident_management_issuable_escalation_statuses" 
    ORDER BY "incident_management_issuable_escalation_statuses"."id" ASC 
    LIMIT 1 
    
    SELECT "incident_management_issuable_escalation_statuses"."id" 
    FROM "incident_management_issuable_escalation_statuses" 
    WHERE "incident_management_issuable_escalation_statuses"."id" >= 1 
    ORDER BY "incident_management_issuable_escalation_statuses"."id" ASC 
    LIMIT 1 
    OFFSET 1000 
    
    # Nullify values for records
    UPDATE "incident_management_issuable_escalation_statuses" 
    SET "policy_id" = NULL, "escalations_started_at" = NULL 
    WHERE "incident_management_issuable_escalation_statuses"."id" IN (
      SELECT "incident_management_issuable_escalation_statuses"."id" 
      FROM "incident_management_issuable_escalation_statuses" 
      INNER JOIN alert_management_alerts ON alert_management_alerts.issue_id = incident_management_issuable_escalation_statuses.issue_id 
      WHERE "incident_management_issuable_escalation_statuses"."id" >= 1 
      AND "incident_management_issuable_escalation_statuses"."policy_id" IS NOT NULL
    ) 

Screenshots or screen recordings

Expected behavior Original behavior, if different
Changing the status of an alert
[WITH associated incident]
- Alert status changes.
- Alert gets a system note.
- Alert status changes.
- Alert gets a system note.
- Incident status changes.
- Incident gets a system note which references the alert.
Changing the status of an alert
[WITHOUT associated incident]
- Alert status changes.
- Alert gets a system note.
Changing the status of an incident
[WITH associated alert]
- Incident status changes.
- Incident gets a system note.
- Incident status changes.
- Incident gets a system note.
- Alert status changes.
- Alert gets a system note which references the incident.
Changing the status of an incident
[WITHOUT associated alert]
- Incident status changes.
- Incident gets a system note.
Opening an incident from an alert - Incident status is set to Triggered. - Incident status is set to match the alert.
- Incident escalation policy is set to match the alert.
Receiving a recovery alert
[WITH associated incident]
- If setting enabled, incident is resolved.
Closing an incident
[WITH associated alert]
- Alert is resolved.
Setting an escalation policy for an incident
[WITHOUT associated alert]
- Sets the status to Triggered & starts escalations.
Setting an escalation policy for an incident
[WITH associated alert]
- Policy is not modifiable.
- Policy value matches the associated alert.
- Policy is not modifiable.
- Policy is blank.

How to set up and validate locally

  • Pre-req: project with maintainer+ user
  1. Creating an escalation policy:
    • Nav to Monitor > Escalations Policies
    • Select Add an escalation policy button to create a policy
    • Add a rule to notify a single user & save (user rule isn't necessary, just fastest)
  2. Creating an incident:
    • Create a normal issue with a type incident.
    • Status & escalation policy field are in the sidebar.
  3. Creating an incident with an associated alert:
    • Set up an active alert integration. Skip the custom mapping (skipping the mapping isn't necessary, just fastest).
    • Select the Send test alert tab to send a payload like { "title": "Sample alert to test incident/alert statuses" }
    • Nav to Monitor > Alerts to find the new alert
    • Select the Create incident button
  4. Sending a recovery alert:
    • Select the Send test alert tab for the alert integration, and send a resolving payload like { "title": "Sample alert to test incident/alert statuses", "end_time": "2022-06-30T03:01:53.772Z" }

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Sarah Yasonik

Merge request reports