HAProxy alerts should fire based on error-rate rather than static value
From: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6397, we would like to alert based on error-rate rather than per-second average rate of increase of total errors. Alerts in the scope of this work:
- IncreasedBackendConnectionErrors
rate(haproxy_backend_connection_errors_total[1m]) > .1
. What this means is that 0.1 * 60 = 6 error is enough to trigger this alert regardless of the total connections (which might also increase over time) - IncreasedServerConnectionErrors
rate(haproxy_server_response_errors_total[1m]) > .5
. Here it means 0.5 * 60 = 30 error is enough to trigger this alert. - IncreasedServerResponseErrors
rate(haproxy_server_connection_errors_total[1m]) > .1
. Here, it is also 6 errors.
The proposal here is to calculate the above based on error-rate where we calculate it as: error count
/ total count
. For example, for the IncreasedServerResponseErrors
and api_rate_limit
backend we could do:
sum(rate(haproxy_server_response_errors_total{backend="api_rate_limit"}[1m]) * 60) / sum(rate(haproxy_server_http_responses_total{backend="api_rate_limit"}[1m]) * 60)
We will then have to set a threshold of error rate. The highest number of requests api_rate_limit
processed in the last 2 weeks was around 1600 requests per min (ref: https://prometheus.gprd.gitlab.net/graph?g0.range_input=1w&g0.expr=sum(rate(haproxy_server_http_responses_total%7Bbackend%3D%22api_rate_limit%22%7D%5B1m%5D))%20by%20(backend)&g0.tab=0). The current threshold of > .1
would mean that 6 errors out of 1600 requests would be enough to trigger the alert and page us. This is 0.003%.