Skip to content

Add Pending Alert Escalations table, model, services and worker

Sean Arnold requested to merge 323139-create-alert-escalations into master

What does this MR do?

Note: This is behind feature flag escalation_policies_mvc, and licensed flag escalation_policies.

DB Migration

This adds the AlertEscalation(incident_management_alert_escalations) table, as part of #323139 (closed).

incident_management_pending_alert_escalations type Null
id bigint not null
rule_id bigint null
alert_id bigint not null
schedule_id bigint not null
status smallint not null
process_at time with zone not null
created_at time with zone not null
updated_at time with zone not null

Database commands:

Up
== 20210617022324 CreateIncidentManagementPendingAlertEscalations: migrating ==

CREATE TABLE incident_management_pending_alert_escalations (
  id bigserial NOT NULL,
  rule_id bigint,
  alert_id bigint NOT NULL,
  schedule_id bigint NOT NULL,
  process_at timestamp with time zone NOT NULL,
  created_at timestamp with time zone NOT NULL,
  updated_at timestamp with time zone NOT NULL,
  status smallint NOT NULL,
  PRIMARY KEY (id, process_at)
) PARTITION BY RANGE (process_at);
CREATE INDEX index_incident_management_pending_alert_escalations_on_alert_id
  ON incident_management_pending_alert_escalations USING btree (alert_id);

CREATE INDEX index_incident_management_pending_alert_escalations_on_rule_id
  ON incident_management_pending_alert_escalations USING btree (rule_id);

CREATE INDEX index_incident_management_pending_alert_escalations_on_schedule_id
  ON incident_management_pending_alert_escalations USING btree (schedule_id);

CREATE INDEX index_incident_management_pending_alert_escalations_on_process_at
  ON incident_management_pending_alert_escalations USING btree (process_at);

ALTER TABLE incident_management_pending_alert_escalations ADD CONSTRAINT fk_rails_fcbfd9338b
  FOREIGN KEY (schedule_id) REFERENCES incident_management_oncall_schedules(id) ON DELETE CASCADE;

ALTER TABLE incident_management_pending_alert_escalations ADD CONSTRAINT fk_rails_057c1e3d87
  FOREIGN KEY (rule_id) REFERENCES incident_management_escalation_rules(id) ON DELETE SET NULL;

ALTER TABLE incident_management_pending_alert_escalations ADD CONSTRAINT fk_rails_8d8de95da9
  FOREIGN KEY (alert_id) REFERENCES alert_management_alerts(id) ON DELETE CASCADE;
Down
== 20210617022324 CreateIncidentManagementPendingAlertEscalations: reverting ==
-- drop_table(:incident_management_pending_alert_escalations)
   -> 0.0145s
== 20210617022324 CreateIncidentManagementPendingAlertEscalations: reverted (0.0216s)

Creation of Pending Alert Escalations

We create an escalation on all incoming alerts where the project has an Escalation policy (and rules) set up. This is of course guarded by the feature flag.

The logic for creating the escalations is held in IncidentManagement::PendingEscalations::CreateService, which takes a target (an AlertManagement::Alert, and in the future, an Incident issue).

Deleting / Creating Escalations on status changes

We create or delete escalations as a result of an Alert status change:

Alert Status change Result
triggered/acknowledged -> resolved/ignored Delete existing Alert Escalations for alert
resolved/ignored -> triggered/acknowledged Create a new Alert Escalation for the alert
resolved/ignored -> resolved/ignored No change
triggered/acknowledged -> triggered/acknowledged No change

IncidentManagement::PendingEscalations::ProcessService

This evaluates the rule information that is stored on each PendingEscalation. If the criteria is met (the required status is not set on the alert, and enough time as passed so that process_at is now in the past), then we notify the oncall schedule.

Workers

To run the service mentioned above, we have a Cron worker and a job worker.

The cron worker, IncidentManagement::Escalations::ScheduleEscalationCheckCronWorker, iterates over the pending escalations which are ready to process, and spawns a IncidentManagement::Escalations::PendingAlertEscalationCheckWorker job for each.

It does this in batches of 1000 using bulk_perform_async.

Screenshots (strongly suggested)

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

Does this MR contain changes to processing or storing of credentials or tokens, authorization and authentication methods or other items described in the security review guidelines? If not, then delete this Security section.

  • Label as security and @ mention @gitlab-com/gl-security/appsec
  • The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • Security reports checked/validated by a reviewer from the AppSec team

Related to #323139 (closed)

Edited by Sean Arnold

Merge request reports