Monitor Alertmanager metrics for failing alerts

We had a situation where the slack API tokens for Alertmanager had stopped working (for reasons we are unclear on) and Alertmanger was getting 404 errors on every web-hook request to slack.

Alertmanger exports a metric for errors and it would be straightforward to add an alert on that metric.

rate(alertmanager_notifications_failed_total[1m]) > 0

Right now it would be sufficient to make sure this alert was marked pager:pagerduty as that goes to both pagerduty and slack. As long as they don't both start failing we would find out about it.

Eventually it would be good to add a third, more low-tech reliable communication channel for this alert. e.g. email sent to a hardcoded list of email addresses, possibly even including non-gitlab email addresses.

Assignee Loading
Time tracking Loading