Monitor Alertmanager metrics for failing alerts
We had a situation where the slack API tokens for Alertmanager had stopped working (for reasons we are unclear on) and Alertmanger was getting 404 errors on every web-hook request to slack.
Alertmanger exports a metric for errors and it would be straightforward to add an alert on that metric.
rate(alertmanager_notifications_failed_total[1m]) > 0
Right now it would be sufficient to make sure this alert was marked pager:pagerduty as that goes to both pagerduty and slack. As long as they don't both start failing we would find out about it.
Eventually it would be good to add a third, more low-tech reliable communication channel for this alert. e.g. email sent to a hardcoded list of email addresses, possibly even including non-gitlab email addresses.