How to track latencies in Prometheus
We have some nice prometheus request and byte counters now in gitlab-workhorse. But I am not sure how to approach latency metrics.
From reading https://prometheus.io/docs/practices/histograms/ it sounds like we might want histograms? Here is what I understand.
- suitable data types are 'histogram' and 'summary'
- histogram has fixed bucket boundaries
- summary has fixed quantiles that can be observed; not suitable for aggregation because that cannot be done with quantiles
On gitlab.com we run workhorse on about 20 boxes so we want to aggregate our numbers. That points towards histograms. But then how do we pick our bucket boundaries?
We don't really know what numbers to expect. Some requests will take less than 60 seconds (unicorn timeout), some may take a long time (git clone of a large repo). So how do we pick our bucket sizes?
cc @pcarranza