Trigger happy alerts are contributing to alert fatigue

Currently there are thirteen alerts in our alertmanager configuration which fire when an alarm is triggered for less than one minute. For this issue, I'm going to dub them the trigger-happy alerts.

Any alert that triggers in under a minute should be an exception, and should only happen in the most critical circumstances.

This is not the case at present and its leading to pager fatigue.

Example

(This section is justification for why 10s and 15s durations are unnecessary. Skip if you don't need justification.)

The IncreasedServerResponseErrors alert will fire after the alert condition is met for 10 seconds.

  - alert: IncreasedServerResponseErrors
    expr: rate(haproxy_server_response_errors_total[1m]) > .5
    for: 10s
    labels:
      pager: pagerduty
      severity: critical
    annotations:
      description: We are seeing an increase in server response errors on {{$labels.fqdn}} for backend/server {{$labels.backend}}/{{$labels.server}}.
        This likely indicates that requests are being sent to servers and there are errors reported to users.
      runbook: troubleshooting/haproxy.md
      title: Increased Server Response Errors

The first problem with the alert condition rate(haproxy_server_response_errors_total[1m]) > .5 is that it's not aggregated. For the given labels on the metric haproxy_server_response_errors_total, there are 2334 (!!) different combinations (the main combinations being haproxy node+backend type+backend server) and each of these combinations is treated as an individual alert. Lets discuss that in another issue though.

At this moment, HAProxy is handling about 3676 http requests per second, or 224284 requests per minute (source sum(rate(haproxy_http_response_duration_seconds_count[1m]))).

Now, since we're not aggregating and we're treating each front-end-to-backend-server combination as a separate alert, the highest of these combinations comes in at 5406 requests per minute (source sort(rate(haproxy_http_response_duration_seconds_count[1m])) * 60).

For us to hit the required error rate of 0.5 (per second) at which we will send a critical pagerduty alert, possibly waking someone up, we need to see 30 errors in a 60s period (the 1m value).

So, 30 errors in one minute, our of 5406 gives us an error rate of 0.5%?

It's also worth pointing out that since the number of requests continues to rise, the required error rate needed for an alert will continue to decrease.

At our current volumes, 30 alerts in a minute is just background noise.

This explains why nobody is taking action for these alerts, or the other trigger happy alerts.

Which alerts fire in less than 1 minute?

These trigger-happy alerts are:

MonitorGitlabNetPrometheusDown
MonitorGitlabNetNotAccessible
HighWebErrorRate
IncreasedErrorRateHTTPSGit
IncreasedErrorRateOtherBackends
IncreasedBackendConnectionErrors
IncreasedServerResponseErrors
IncreasedServerConnectionErrors
HighRailsErrorRateWarning
BlackBoxGitPullHttps
BlackBoxGitPullSsh
BlackBoxGitPushHttps
BlackBoxGitPushSsh

Now, lets query prometheus to find out what the most persistent pagerduty alerts triggered from alertmanager have been for the past two weeks

sort(sum((sum_over_time(ALERTS{alertstate="firing", severity="critical", environment="gprd"}[2w]))) by (alertname, environment))

^{Source: https://prometheus.gprd.gitlab.net/graph?g0.range_input=6h&g0.expr=sort(sum((sum_over_time(ALERTS%7Balertstate%3D%22firing%22%2C%20severity%3D%22critical%22%2C%20environment%3D%22gprd%22%7D%5B2w%5D)))%20by%20(alertname%2C%20environment))&g0.tab=1}

So, about 74% of our alert noise over the past two weeks comes from these trigger-happy alerts.

This is not coincidence.

Why we shouldn't have 15 second alerts

A pagerduty alert should be raised in the exceptional situation in which our infrastructure has detected an error which cannot be resolved automatically and requires human intervention. By raising a pagerduty alert after 15 seconds, we're effectively saying that under the error condition, our infrastructure cannot handle 15s without operator intervention.

We should be building systems to be resiliant. Our goal should be that any error condition, bar an asteroid hitting the datacentre, can be managed without operator intervention for at least a few minutes.

What should we do with these alerts?

I favour using anomaly detection using the general alerts, but these need to be plumbed into Pagerduty before we can move forward.
In the interim, at the very least we should review these thirteen alerts and consider extending the pending period to at least 1 minute.
We should also use error rates, not static threshold like 0.5 in our alerts.
We should conduct a review of all alerts that don't aggregate their expressions and consider if they are alerting on what we intend them to be alerting on.

cc @glopezfernandez