Specify how to action on triage-ops uptime incidents (!60) · Merge requests · GitLab.org / Quality Department / Engineering Productivity / team

Jennifer Li requested to merge jennli-main-patch-07432 into main Feb 24, 2023

I would like to add a section to the triage-ops uptime monitoring runbook specifying:

why this alert was raised
Do I need to acknowledge if the alert was a false positive
What is the current time period threshold for not receiving an event. I found the json saying this is set to 300s, but I can't find where we configured this setting.

I found myself asking these questions when I see these incidents, so it's worth documenting them here.

Why can't we define weekend policy? I vaguely remember that was not an option, but I also cannot find the corresponding conversation. The false positive noise in the team channel can be further driven down if we can somehow tweak the policy for weekends.
Should we always start an issue when seeing these incidents being reported?
Do these incidents auto-close when stopped? Should we ask triager to close these incidents in Google Cloud or is that optional? I was unsure if I should just close the incident when I confirmed today's alert was false positive.

Edited Feb 24, 2023 by Jennifer Li

Specify how to action on triage-ops uptime incidents