http_request_duration_seconds_count metrics should be initialized to zero at startup

The http_request_duration_seconds_count has a label for status.

Each time a request returns a HTTP 500 status, the http_request_duration_seconds_count{status="500"} is incremented etc.

Until the first instance of a particular status code is recorded, after process startup, the series for that status code does not exist.

This is fine, except that it means we can't tell when the metric is missing because of a configuration error, and when the metric is missing because there have not been any errors. This leads to us generating alerts about missing metrics, such as:

image

Another example:

https://prometheus-app.gprd.gitlab.net/graph?g0.range_input=6h&g0.expr=%20%20%20%20%20%20sum%20by%20(environment%2C%20tier%2C%20type%2C%20stage)%20(rate(http_request_duration_seconds_count%7Bjob%3D%22gitlab-unicorn%22%2C%20status%3D~%22%5E5.*%22%2Cenvironment%3D%22gprd%22%2Cstage%3D%22cny%22%2Ctier%3D%22sv%22%2Ctype%3D%22api%22%7D%5B1m%5D))&g0.tab=0

Instead of lazy initializing these counters, we should eager initialize them at startup time, for a set of common HTTP status codes (200, 301, 304, 400, 401, 403, 404, 500 at a minimum).

cc @bjk

Assignee Loading
Time tracking Loading