Add image scaler duration Prometheus metric

We currently collect a duration_s metric via structured logging. However, it was pointed out in gitlab-com/runbooks!2788 (comment 417291775) that latencies should be tracked in Prometheus as well, so that we can collect long-term trends in Thanos.

We do track HTTP request latencies sliced by the route label already, however, this is not accurate enough for several reasons:

  1. My understanding is that this will observe the entire request duration from first entering the system, going through Rails, and leaving the system again. Since we spend a substantial amount of time in Rails (most of it, in fact), this will distort image scaling durations to an extent that we cannot draw any conclusions from scaler performance anymore.
  2. The route dimension is not fine grained enough, since it will catch all /uploads, but that includes PDFs, videos, etc. Moreover, it will include timings for those cases in which we fail-over to serving the original image, which would distort metrics that should really only apply to the scaler.
  3. The generic HTTP request metrics do not allow us to distinguish between e.g. PNG or JPEG durations, since are not aware of the nature of the request

I chose a Histogram, not a Summary since https://prometheus.io/docs/practices/histograms/ seems to suggest that tracking summaries client-side can be slow, and since we moreover appear to be using histograms in other places already.

Edited by Matthias Käppler

Merge request reports

Loading