Reduce GitLab's histograms to 3-5 buckets for most histograms

This is a fairly radical solution, but is primarily driven by the cardinality explosion that we see in our prometheus stack at present.

At present, each gitlab-rails process exports over 70k metrics, over 8MB of text. We do this 4 times every minute. This results in a huge overhead, both in the Rails process itself, but also in the Prometheus and Thanos infrastructure we run behind our monitoring stack.

At present, our monitoring stack is one of the least reliable services in our stack. Because of its importance to the infrastructure teams that rely on it, and the availability of GitLab.com, the goal should be for it to be the most reliable. We have an error budget of 99.9% availability, which is fairly low, but still don't manage to obtain that for extended periods.

Of the 10 highest cardinality metrics, 9 of them are histograms. This is not surprising, but the cardinality of these metrics is very high - the worse has 400k different combinations 😱

With this in mind, and remembering that histograms are inherently inaccurate inaccurate, and particularly so when using generic default buckets, I would like to propose the following approach instead:

$ curl -vi web-06-sv-gprd.c.gitlab-production.internal:8083/metrics|cut -d'{' -f1|sort|uniq -c|sort -nr|head
  10980 http_redis_requests_duration_seconds_bucket
   8235 http_elasticsearch_requests_duration_seconds_bucket
   8028 gitlab_sql_duration_seconds_bucket
   6405 gitlab_transaction_duration_seconds_bucket
   6405 gitlab_transaction_cputime_seconds_bucket
   6405 gitlab_transaction_allocated_memory_bytes_bucket
   1771 gitlab_cache_operations_total
   1056 http_request_duration_seconds_bucket
    915 http_redis_requests_total
    915 http_redis_requests_duration_seconds_sum

http_redis_requests_duration_seconds_bucket -- 10980 buckets 😱 😱 -- see #460 (closed) for more discussion on this.

Limit most histograms to threshold values

Histograms are mostly used for monitoring latencies on GitLab.com.

Over the past few years, we've generally been moving away from monitoring latencies using percentiles estimations, and moving towards using apdex scores instead.

The minimum number of buckets needed to calculate an apdex score is two - the satisfied threshold bucket and the le="+Inf" bucket.

However, many thresholds are calculated from three buckets, satisfactory, tolerable and le="+Inf".

There is an argument that since the remaining buckets are only every very occasionally used for percentile estimation graphs (they are generally no longer used in monitoring and alerting) and they are inherently inaccurate, and are leading to the current cardinality explosions were seeing, we should simply drop them.

But, what about if I want to compare the latency differences (eg, between feature flag on/off, or canary/main stage)

In this case, its unlikely that you want to use histograms anyway. The likelihood is that most performance differences will continue to fall into the same bucket anyway, so estimating the quantile will give you incorrect results.

Instead, for this purpose, use logs (ie, Kibana/ES) and get a non-estimated, accurate percentile comparison between the old and new code.

What else?

Anecdotally, in a conversation at PromCon, somebody (can't remember who) mentioned that the company they work at (a large prometheus user iirc) have settled on using mostly thresholds in their histograms, due to cardinality issues.

Interesting options

At present, apdex thresholds for GitLab.com are fairly brittle. The thresholds, specified in the runbooks repo and controlled by SREs, must align with bucket boundaries, specified in the GitLab repository and controlled by engineers. When one changes, so must the other, and this is a manual process.

If we switch to two or three bucket histograms, we can address this brittleness.

If we assume the lower boundary, specified by the engineer is always the satisfactory threshold, and the higher boundary is always the tolerated threshold, the following formula can be used to calculate the apdex:

avg without(le) (rate(http_request_duration_bucket_seconds{le!="+Inf"}[5m])
/ ignoring(le) 
rate(http_request_duration_bucket_seconds{le="+Inf"}[5m]

Whats really nice about this approach is that the engineer gets to control the thresholds and the SSOT is in the GitLab codebase.

Edited Jul 08, 2020 by Andrew Newdigate