Specify how to action on triage-ops uptime incidents
What this MR does
I would like to add a section to the triage-ops uptime monitoring runbook specifying:
- why this alert was raised
- Do I need to acknowledge if the alert was a false positive
- What is the current time period threshold for not receiving an event. I found the json saying this is set to 300s, but I can't find where we configured this setting.
I found myself asking these questions when I see these incidents, so it's worth documenting them here.
Other thoughts that were not addressed
- Why can't we define weekend policy? I vaguely remember that was not an option, but I also cannot find the corresponding conversation. The false positive noise in the team channel can be further driven down if we can somehow tweak the policy for weekends.
- Should we always start an issue when seeing these incidents being reported?
- Do these incidents auto-close when stopped? Should we ask triager to close these incidents in Google Cloud or is that optional? I was unsure if I should just close the incident when I confirmed today's alert was false positive.
Edited by Jennifer Li