Skip to content

Remove backfilled escalation statuses

Sarah Yasonik requested to merge 356215-sy-fix-missing-incident-status into master

What does this MR do and why?

Context:

Together with !83787 (merged), this MR resolves a bug in the incident escalations feature, which blocks rollout of the feature flag.

Changes in this MR:

  • Decommission previous backfill migration for incident_management_issuable_escalation_statuses table
  • Removes all existing records in incident_management_issuable_escalation_statuses

Expected state after these changes:

  • No existing incidents will have an 'incident_management_issuable_escalation_status` record
  • After the incident_escalations is enabled/removed:
    • All newly created incidents will have a corresponding escalation_status (default values of Triggered/No policy, or matching any associated alert's attributes)
    • Previously existing incidents will have no escalation_status. The UI will display None for status/policy & a record will be created as-needed if the user wants to utilize these fields.

Note: there was a little direction shift for this MR following the feedback in !83159 (comment 904704277). I've kept the original description easy to view at the bottom of the "What does this MR do and why?" section of the MR description.

database details

incident_management_issuable_escalation_statuses has around 85k records on GitLab.com.

Query plans:

The docs set a 5min upper bound for concurrent operations in migrations & 3min upper bound for total migration time. This migration is well within either bounds. While it is usual for data-only migrations to be in a post-migration, I believe the speed and assurances of parity in behavior for self-managed instances with GitLab.com make this an exception.

Comment from db pipeline job: !83159 (comment 893317065)

% bin/rails db:migrate:down VERSION=20220321234317
== 20220321234317 RemoveAllIssuableEscalationStatuses: reverting ==============
== 20220321234317 RemoveAllIssuableEscalationStatuses: reverted (0.0000s) =====

% bin/rails db:migrate                            
== 20220321234317 RemoveAllIssuableEscalationStatuses: migrating ==============
-- execute("DELETE FROM incident_management_issuable_escalation_statuses")
   -> 0.0044s
== 20220321234317 RemoveAllIssuableEscalationStatuses: migrated (0.0045s) =====
Original MR description/migration details

Context:

Together with !83787 (merged) & !84139 (merged), this MR resolves a bug in the incident escalations feature, which blocks rollout of the feature flag.

There are some Issue.incident records which don't have a corresponding IncidentManagement::IssuableEscalationStatus record. Similarly, some of the existing (from a previous backfill) IssuableEscalationStatus that corresponding to incidents associated with alerts don't have the correct values.

Changes in this MR:

  • Decommission previous backfill migration for incident_management_issuable_escalation_statuses table.
  • Add new backfill migration which:
    • Deletes all existing records
    • Creates new escalation_status records for all incident issues associated with an AlertManagement::Alert using the alert's attributes (representing only a subset of the previous backfill)

Expected state after these changes:

  • All incidents corresponding to alerts will have an associated escalation_status
  • After the incident_escalations is enabled:
    • All newly created incidents will have a corresponding escalation_status (default values of Triggered/No policy)
    • Previously existing incidents will have no escalation_status. The UI will display None for status/policy & a record will be created as-needed if the user wants to utilize these fields.

database details

Up
== 20220321234317 TruncateIssuableEscalationStatuses: migrating ===============
-- execute("TRUNCATE TABLE incident_management_issuable_escalation_statuses")
   -> 0.0078s
== 20220321234317 TruncateIssuableEscalationStatuses: migrated (0.0078s) ======

== 20220405012923 RedoBackfillIncidentIssueEscalationStatuses: migrating ======
-- Scheduled 1 BackfillIncidentIssueEscalationStatuses jobs with a maximum of 50000 records per batch and an interval of 120 seconds.

The migration is expected to take at least 120 seconds. Expect all jobs to have completed after 2022-04-05 19:31:35 UTC."
== 20220405012923 RedoBackfillIncidentIssueEscalationStatuses: migrated (0.0791s) 
Down
== 20220321234317 TruncateIssuableEscalationStatuses: reverting ===============
== 20220321234317 TruncateIssuableEscalationStatuses: reverted (0.0000s) ======

== 20220405012923 RedoBackfillIncidentIssueEscalationStatuses: reverting ======
== 20220405012923 RedoBackfillIncidentIssueEscalationStatuses: reverted (0.0000s) 

Truncation is pretty speedy, but you can't use explain with TRUNCATE. Screen_Shot_2022-03-29_at_12.50.18_PM

Background migration jobs

Queries for INSERT
SELECT "alert_management_alerts"."status", "alert_management_alerts"."issue_id", "alert_management_alerts"."created_at", "alert_management_alerts"."project_id" 
FROM "alert_management_alerts" 
INNER JOIN "issues" ON "issues"."id" = "alert_management_alerts"."issue_id" 
WHERE "alert_management_alerts"."id" BETWEEN 1 AND 264 
AND ("issues"."issue_type" = 1) 
AND "alert_management_alerts"."id" >= 170

https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/9466/commands/33599

SELECT DISTINCT ON (project_id) project_id, "incident_management_escalation_policies"."id" 
FROM "incident_management_escalation_policies" 
WHERE "incident_management_escalation_policies"."project_id" IN (44, 43)

https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/9466/commands/33600

INSERT INTO "incident_management_issuable_escalation_statuses" ("created_at","updated_at","status","issue_id","policy_id","escalations_started_at") 
VALUES 
  ('2022-03-30 21:50:05.083051', '2022-03-30 21:50:05.083051', 1, 255, 43, '2022-03-30 21:50:04'), 
  ('2022-03-30 21:50:05.083138', '2022-03-30 21:50:05.083138', 2, 257, 43, '2022-03-30 21:50:04'), 
  ('2022-03-30 21:50:05.083162', '2022-03-30 21:50:05.083162', 1, 261, NULL, NULL), 
  ('2022-03-30 21:50:05.083174', '2022-03-30 21:50:05.083174', 2, 263, NULL, NULL) 
ON CONFLICT  DO NOTHING 
RETURNING "id"

alert_management_alerts has around 9.1M records on SaaS. A subset of those have associated issues, most of which are incidents.

In batches of 30k alerts, testing insertion with a single query on postgres.ai showed ~2-24 writes per batch, consistent with what we might expect. Run time of ~500ms to 2s per query. This will be longer than the current version, having broken up queries and batching.
WITH alerts AS (
  SELECT
    issue_id,
    status,
    project_id,
    created_at
  FROM alert_management_alerts
  WHERE id BETWEEN $1 AND $2
  AND issue_id IS NOT NULL
), policies AS (
  SELECT
    DISTINCT ON (project_id) project_id,
    id
  FROM incident_management_escalation_policies
  WHERE project_id IN (SELECT project_id FROM alerts)
  ORDER BY project_id
)
INSERT INTO incident_management_issuable_escalation_statuses (created_at, updated_at, issue_id, status, policy_id, escalations_started_at)
  SELECT
    current_timestamp AS created_at,
    current_timestamp AS updated_at,
    alerts.issue_id AS issue_id,
    alerts.status AS status,
    policies.id AS policy_id,
    CASE WHEN policies.id IS NOT NULL THEN alerts.created_at END AS escalations_started_at
  FROM alerts
  INNER JOIN issues ON alerts.issue_id = issues.id
  LEFT JOIN policies ON alerts.project_id = policies.project_id
  WHERE issues.issue_type = 1
ON CONFLICT (issue_id) DO NOTHING;

In reality, it would probably be fine to execute a single insertion per background job, since the number of records we expect to save isn't terribly large. We should account for an edge cases, however: If a self-managed instance has a large number of alerts on a project which automatically creates incidents for each alert, that could result in a lot of records for a single 30k batch. This is unlikely, but a possibility, regardless. Calling each_batch from within the background job protects against that.

Queries from .each_batch
SELECT "alert_management_alerts"."id" 
FROM "alert_management_alerts" 
INNER JOIN "issues" ON "issues"."id" = "alert_management_alerts"."issue_id" 
WHERE "alert_management_alerts"."id" BETWEEN 1 AND 264 
AND ("issues"."issue_type" = 1) 
ORDER BY "alert_management_alerts"."id" ASC 
LIMIT 1
SELECT "alert_management_alerts"."id" 
FROM "alert_management_alerts" 
INNER JOIN "issues" ON "issues"."id" = "alert_management_alerts"."issue_id" 
WHERE "alert_management_alerts"."id" BETWEEN 1 AND 264 
AND ("issues"."issue_type" = 1) AND "alert_management_alerts"."id" >= 170 
ORDER BY "alert_management_alerts"."id" ASC 
LIMIT 1 
OFFSET 10

How to set up and validate locally

!83787 (merged) goes through details of testing.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #356215 (closed)

Edited by Sarah Yasonik

Merge request reports