Metrics: Migrate the `status` label from `http_request_duration_seconds_bucket` to `http_requests_total`
Currently, we use a status
label on the http_request_duration_seconds_bucket
histogram metric to monitor the number of requests that are completing with a given status code.
We should move this label from the http_request_duration_seconds_bucket
histogram to the http_requests_total
counter metric.
Advantages
- Greatly reduced cardinality. We will generate about a quarter of the metrics we are currently generating. This has downstream consequences on our monitoring stack (Thanos, Prometheus etc).
- We can ignore health check from our SLIs. At present, health check counters are stored on
http_health_requests_total
as opposed tohttp_requests_total
, but this is not useable since thestatus
label is onhttp_request_duration_seconds_bucket
. More analysis from @cmiskell here: #267 (comment 404373425) - Fewer alerts during deployments
Disadvantages
- This may have a customer impact if customers are using their own monitoring and alerting rules
- We will not be able to measure latencies of requests by a status code dimension. This is a fairly standard compromise.
Metric Load Reduction
Some quick arithmetic showing how moving this label from the duration histogram to the request counter makes a big difference to our Prometheus metric load with very little loss to our observability
Currently
-
count(http_requests_total{env="gprd", job="gitlab-rails"})
-> 510 series -
count(count(http_request_duration_seconds_bucket{env="gprd", job="gitlab-rails"}) by (status))
- 25 status codes -
count(http_request_duration_seconds_bucket{env="gprd", job="gitlab-rails"})
-> 79431 -
count(http_request_duration_seconds_sum{env="gprd", job="gitlab-rails"})
-> 7221 -
count(http_request_duration_seconds_count{env="gprd", job="gitlab-rails"})
-> 7221
Total series before: 94383
status
to http_requests_total
Moving -
count(http_requests_total{env="gprd", job="gitlab-rails"})
-> 510 series x 25 =~ 12750 (worst case) -
count(http_request_duration_seconds_bucket{env="gprd", job="gitlab-rails"})
-> 7865 series -
count(http_request_duration_seconds_sum{env="gprd", job="gitlab-rails"})
-> 715 series -
count(http_request_duration_seconds_count{env="gprd", job="gitlab-rails"})
-> 715 series
Total series after: 22045 = 23% of the original series count (worst case!)
Edited by Andrew Newdigate