Skip to content

Specify how to action on triage-ops uptime incidents

Jennifer Li requested to merge jennli-main-patch-07432 into main

What this MR does

I would like to add a section to the triage-ops uptime monitoring runbook specifying:

  • why this alert was raised
  • Do I need to acknowledge if the alert was a false positive
  • What is the current time period threshold for not receiving an event. I found the json saying this is set to 300s, but I can't find where we configured this setting.

I found myself asking these questions when I see these incidents, so it's worth documenting them here.

Other thoughts that were not addressed

  • Why can't we define weekend policy? I vaguely remember that was not an option, but I also cannot find the corresponding conversation. The false positive noise in the team channel can be further driven down if we can somehow tweak the policy for weekends.
  • Should we always start an issue when seeing these incidents being reported?
  • Do these incidents auto-close when stopped? Should we ask triager to close these incidents in Google Cloud or is that optional? I was unsure if I should just close the incident when I confirmed today's alert was false positive.
Edited by Jennifer Li

Merge request reports