Reduce number of buckets for Prometheus histograms created by web transactions
We have some metrics with very high (> 100,000) cardinality (that is, the number of unique label sets for observations in that metric).
All of those over 200,000 are:
- Used by web requests.
- Histograms.
We can see the sources of cardinality increase from that. For instance, web requests have controller
and action
labels to uniquely identify what is processing a request, and those give almost 2,000 unique values: https://thanos-query.ops.gitlab.net/graph?g0.range_input=1h&g0.max_source_resolution=0s&g0.expr=count(count%20by%20(controller%2C%20action)%20(gitlab_sql_duration_seconds_bucket))&g0.tab=1
The histograms give a label value per bucket (histograms work by adding an le
label to the metric). As the Redis histogram has more buckets (compare https://gitlab.com/gitlab-org/gitlab/-/blob/v13.1.0-ee/lib/gitlab/instrumentation/redis.rb#L15 and https://gitlab.com/gitlab-org/gitlab/-/blob/v13.1.0-ee/lib/gitlab/metrics/subscribers/active_record.rb#L31, for instance), it has higher cardinality.
The other contributing factors to the cardinality here aren't controlled by the application, but we can see that they add up to around 50 unique values: https://thanos-query.ops.gitlab.net/graph?g0.range_input=1h&g0.max_source_resolution=0s&g0.expr=count(count%20by%20(fqdn%2C%20instance%2C%20job%2C%20tier%2C%20sv%2C%20region%2C%20env%2C%20environment%2C%20job%2C%20type)%20(gitlab_sql_duration_seconds_bucket%7Btype%3D~%22api%7Cweb%22%7D))&g0.tab=1
2,000 (controller, action) * 12 (histogram buckets) * 50 (remaining labels) = 1,200,000 - so clearly this is an overestimate, but it gives a rough idea
From the application side, the easiest way to reduce the cardinality is to reduce the number of histogram buckets. @andrewn suggested (https://gitlab.slack.com/archives/C014BCMAAVB/p1593436851247500?thread_ts=1593417529.233100&cid=C014BCMAAVB) that we simply change these buckets to only those needed for apdex scores.
- We don't need to do this for Sidekiq metrics right now.
- We would also need to remove any charts using these histograms from the dashboards, as they'd become very unhelpful (we can still use Kibana for exact timings - which is always more accurate anyway).