Skip to content

Remove healthchecks from prometheus endpoint

Andrew Newdigate requested to merge an/no-healthcheck-until-brooklyn into master

What does this MR do?

Are there points in the code the reviewer needs to double check?

# Simulate a Gitaly server under extreme load....
$ lldb -p $(pgrep gitaly) & 

# Without this fix: Prometheus scrape endpoint times out after ~55s with a 500 error
$ time curl http://localhost:3000/-/metrics
NameError at /-/metrics
=======================

> uninitialized constant GRPC::GRPC
...
real	0m58.135s
user	0m0.014s
sys	0m0.016s

# With this fix: Prometheus scrape endpoint returns the prometheus metrics successfully, immediately
$ time curl http://localhost:3000/-/metrics
client_browser_timing_count{event="contentComplete"} 40
# HELP client_browser_timing Multiprocess metric
# TYPE client_browser_timing histogram
client_browser_timing_bucket{event="connect",le="+Inf"} 40
client_browser_timing_bucket{event="connect",le="0.005"} 40
...

real	0m0.607s
user	0m0.020s
sys	0m0.027s

Why was this MR needed?

Similar reasons as https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/20552:

When any single Gitaly server fails, up to 50% of the web and api fleet workload is saturated by prometheus healthcheck requests, which happen 4 times a minute on each node, with each request currently taking almost a full minute, before failing with a 500 error.

With this fix: when any single Gitaly server fails: the prometheus endpoint will return immediately with the correct metrics and a 200 response.

Does this MR meet the acceptance criteria?

What are the relevant issue numbers?

Edited by Andrew Newdigate

Merge request reports