Remove backfilled escalation statuses
What does this MR do and why?
- Related issue: #356215 (closed)
- Blocking MRs: !83787 (merged), !84139 (merged), !84633 (merged)
⚠ This MR should merge only after the blocking MRs.
Context:
Together with !83787 (merged), this MR resolves a bug in the incident escalations feature, which blocks rollout of the feature flag.
Changes in this MR:
- Decommission previous backfill migration for
incident_management_issuable_escalation_statuses
table - Removes all existing records in
incident_management_issuable_escalation_statuses
Expected state after these changes:
- No existing incidents will have an 'incident_management_issuable_escalation_status` record
- After the
incident_escalations
is enabled/removed:- All newly created incidents will have a corresponding
escalation_status
(default values of Triggered/No policy, or matching any associated alert's attributes) - Previously existing incidents will have no
escalation_status
. The UI will displayNone
for status/policy & a record will be created as-needed if the user wants to utilize these fields.
- All newly created incidents will have a corresponding
Note: there was a little direction shift for this MR following the feedback in !83159 (comment 904704277). I've kept the original description easy to view at the bottom of the "What does this MR do and why?" section of the MR description.
database details
incident_management_issuable_escalation_statuses
has around 85k records on GitLab.com.
Query plans:
- https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/9645/commands/34197
- https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/9645/commands/34199
- Alt option: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/9641/commands/34187 (single statement; 1.6s total migration time from pipeline job)
The docs set a 5min upper bound for concurrent operations in migrations & 3min upper bound for total migration time. This migration is well within either bounds. While it is usual for data-only migrations to be in a post-migration, I believe the speed and assurances of parity in behavior for self-managed instances with GitLab.com make this an exception.
Comment from db pipeline job: !83159 (comment 893317065)
% bin/rails db:migrate:down VERSION=20220321234317
== 20220321234317 RemoveAllIssuableEscalationStatuses: reverting ==============
== 20220321234317 RemoveAllIssuableEscalationStatuses: reverted (0.0000s) =====
% bin/rails db:migrate
== 20220321234317 RemoveAllIssuableEscalationStatuses: migrating ==============
-- execute("DELETE FROM incident_management_issuable_escalation_statuses")
-> 0.0044s
== 20220321234317 RemoveAllIssuableEscalationStatuses: migrated (0.0045s) =====
Original MR description/migration details
Context:
Together with !83787 (merged) & !84139 (merged), this MR resolves a bug in the incident escalations feature, which blocks rollout of the feature flag.
There are some Issue.incident
records which don't have a corresponding IncidentManagement::IssuableEscalationStatus
record. Similarly, some of the existing (from a previous backfill) IssuableEscalationStatus
that corresponding to incidents associated with alerts don't have the correct values.
Changes in this MR:
- Decommission previous backfill migration for
incident_management_issuable_escalation_statuses
table. - Add new backfill migration which:
- Deletes all existing records
- Creates new
escalation_status
records for all incident issues associated with anAlertManagement::Alert
using the alert's attributes (representing only a subset of the previous backfill)
Expected state after these changes:
- All incidents corresponding to alerts will have an associated
escalation_status
- After the
incident_escalations
is enabled:- All newly created incidents will have a corresponding
escalation_status
(default values of Triggered/No policy) - Previously existing incidents will have no
escalation_status
. The UI will displayNone
for status/policy & a record will be created as-needed if the user wants to utilize these fields.
- All newly created incidents will have a corresponding
database details
Up
== 20220321234317 TruncateIssuableEscalationStatuses: migrating ===============
-- execute("TRUNCATE TABLE incident_management_issuable_escalation_statuses")
-> 0.0078s
== 20220321234317 TruncateIssuableEscalationStatuses: migrated (0.0078s) ======
== 20220405012923 RedoBackfillIncidentIssueEscalationStatuses: migrating ======
-- Scheduled 1 BackfillIncidentIssueEscalationStatuses jobs with a maximum of 50000 records per batch and an interval of 120 seconds.
The migration is expected to take at least 120 seconds. Expect all jobs to have completed after 2022-04-05 19:31:35 UTC."
== 20220405012923 RedoBackfillIncidentIssueEscalationStatuses: migrated (0.0791s)
Down
== 20220321234317 TruncateIssuableEscalationStatuses: reverting ===============
== 20220321234317 TruncateIssuableEscalationStatuses: reverted (0.0000s) ======
== 20220405012923 RedoBackfillIncidentIssueEscalationStatuses: reverting ======
== 20220405012923 RedoBackfillIncidentIssueEscalationStatuses: reverted (0.0000s)
Truncation is pretty speedy, but you can't use explain
with TRUNCATE
.
Background migration jobs
INSERT
Queries for SELECT "alert_management_alerts"."status", "alert_management_alerts"."issue_id", "alert_management_alerts"."created_at", "alert_management_alerts"."project_id"
FROM "alert_management_alerts"
INNER JOIN "issues" ON "issues"."id" = "alert_management_alerts"."issue_id"
WHERE "alert_management_alerts"."id" BETWEEN 1 AND 264
AND ("issues"."issue_type" = 1)
AND "alert_management_alerts"."id" >= 170
https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/9466/commands/33599
SELECT DISTINCT ON (project_id) project_id, "incident_management_escalation_policies"."id"
FROM "incident_management_escalation_policies"
WHERE "incident_management_escalation_policies"."project_id" IN (44, 43)
https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/9466/commands/33600
INSERT INTO "incident_management_issuable_escalation_statuses" ("created_at","updated_at","status","issue_id","policy_id","escalations_started_at")
VALUES
('2022-03-30 21:50:05.083051', '2022-03-30 21:50:05.083051', 1, 255, 43, '2022-03-30 21:50:04'),
('2022-03-30 21:50:05.083138', '2022-03-30 21:50:05.083138', 2, 257, 43, '2022-03-30 21:50:04'),
('2022-03-30 21:50:05.083162', '2022-03-30 21:50:05.083162', 1, 261, NULL, NULL),
('2022-03-30 21:50:05.083174', '2022-03-30 21:50:05.083174', 2, 263, NULL, NULL)
ON CONFLICT DO NOTHING
RETURNING "id"
alert_management_alerts
has around 9.1M records on SaaS. A subset of those have associated issues, most of which are incidents.
In batches of 30k alerts, testing insertion with a single query on postgres.ai showed ~2-24 writes per batch, consistent with what we might expect. Run time of ~500ms to 2s per query. This will be longer than the current version, having broken up queries and batching.
WITH alerts AS (
SELECT
issue_id,
status,
project_id,
created_at
FROM alert_management_alerts
WHERE id BETWEEN $1 AND $2
AND issue_id IS NOT NULL
), policies AS (
SELECT
DISTINCT ON (project_id) project_id,
id
FROM incident_management_escalation_policies
WHERE project_id IN (SELECT project_id FROM alerts)
ORDER BY project_id
)
INSERT INTO incident_management_issuable_escalation_statuses (created_at, updated_at, issue_id, status, policy_id, escalations_started_at)
SELECT
current_timestamp AS created_at,
current_timestamp AS updated_at,
alerts.issue_id AS issue_id,
alerts.status AS status,
policies.id AS policy_id,
CASE WHEN policies.id IS NOT NULL THEN alerts.created_at END AS escalations_started_at
FROM alerts
INNER JOIN issues ON alerts.issue_id = issues.id
LEFT JOIN policies ON alerts.project_id = policies.project_id
WHERE issues.issue_type = 1
ON CONFLICT (issue_id) DO NOTHING;
- https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/9443/commands/33534 (fastest)
- https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/9443/commands/33519 (slowest)
- https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/9443/commands/33520
- https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/9443/commands/33522
- https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/9443/commands/33536
In reality, it would probably be fine to execute a single insertion per background job, since the number of records we expect to save isn't terribly large. We should account for an edge cases, however: If a self-managed instance has a large number of alerts on a project which automatically creates incidents for each alert, that could result in a lot of records for a single 30k batch. This is unlikely, but a possibility, regardless. Calling each_batch
from within the background job protects against that.
.each_batch
Queries from SELECT "alert_management_alerts"."id"
FROM "alert_management_alerts"
INNER JOIN "issues" ON "issues"."id" = "alert_management_alerts"."issue_id"
WHERE "alert_management_alerts"."id" BETWEEN 1 AND 264
AND ("issues"."issue_type" = 1)
ORDER BY "alert_management_alerts"."id" ASC
LIMIT 1
SELECT "alert_management_alerts"."id"
FROM "alert_management_alerts"
INNER JOIN "issues" ON "issues"."id" = "alert_management_alerts"."issue_id"
WHERE "alert_management_alerts"."id" BETWEEN 1 AND 264
AND ("issues"."issue_type" = 1) AND "alert_management_alerts"."id" >= 170
ORDER BY "alert_management_alerts"."id" ASC
LIMIT 1
OFFSET 10
How to set up and validate locally
!83787 (merged) goes through details of testing.
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #356215 (closed)