Skip to content

Attempts to make paging, more forgivable

John Skarbek requested to merge jts/dont-wake-me into master
  • The goal here is to prevent one from being woken up when more than 33% of servers are suffering from the same problem
  • The existing alert only took into account ONE server. Let's leave that in place as a trail towards a problem, but not enough to warrant waking someone up. So we lower the severity and where that alert should go. Removing pager duty should leave it in slack, and making it warn won't require us to respond
  • Let's create a new alert that takes into account the fact that we run multiples of the same set of servers. So if 33% of the fleet is suffering from this problem, page us, something MUST be wrong.
  • Removes the sensitivity of the backend_connection alert to lessen it's potential toil factor.
  • Closes: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6059

Questions to be answered prior to merge

  • Can this query be improved?
  • Is this query stressful on the prometheus server?
  • Does this do what our team wants out of alerting/paging?

/cc @bjk-gitlab for the promql specific question /cc @gitlab-com/gl-infra for the alert questions

Edited by John Skarbek

Merge request reports