Add image scaler duration Prometheus metric
We currently collect a duration_s metric via structured logging. However, it was pointed out in gitlab-com/runbooks!2788 (comment 417291775) that latencies should be tracked in Prometheus as well, so that we can collect long-term trends in Thanos.
We do track HTTP request latencies sliced by the route label already, however, this is not accurate enough for several reasons:
- My understanding is that this will observe the entire request duration from first entering the system, going through Rails, and leaving the system again. Since we spend a substantial amount of time in Rails (most of it, in fact), this will distort image scaling durations to an extent that we cannot draw any conclusions from scaler performance anymore.
- The
routedimension is not fine grained enough, since it will catch all/uploads, but that includes PDFs, videos, etc. Moreover, it will include timings for those cases in which we fail-over to serving the original image, which would distort metrics that should really only apply to the scaler. - The generic HTTP request metrics do not allow us to distinguish between e.g. PNG or JPEG durations, since are not aware of the nature of the request
I chose a Histogram, not a Summary since https://prometheus.io/docs/practices/histograms/ seems to suggest that tracking summaries client-side can be slow, and since we moreover appear to be using histograms in other places already.
Edited by Matthias Käppler