Remove healthchecks from prometheus endpoint (!20565) · Merge requests · GitLab.org / GitLab FOSS

Andrew Newdigate requested to merge an/no-healthcheck-until-brooklyn into master Jul 11, 2018

What does this MR do?

Fixes https://gitlab.com/gitlab-org/gitlab-ce/issues/49112
Metrics will be fixed in https://gitlab.com/gitlab-org/gitlab-ce/issues/49178

Are there points in the code the reviewer needs to double check?

# Simulate a Gitaly server under extreme load....
$ lldb -p $(pgrep gitaly) & 

# Without this fix: Prometheus scrape endpoint times out after ~55s with a 500 error
$ time curl http://localhost:3000/-/metrics
NameError at /-/metrics
=======================

> uninitialized constant GRPC::GRPC
...
real	0m58.135s
user	0m0.014s
sys	0m0.016s

# With this fix: Prometheus scrape endpoint returns the prometheus metrics successfully, immediately
$ time curl http://localhost:3000/-/metrics
client_browser_timing_count{event="contentComplete"} 40
# HELP client_browser_timing Multiprocess metric
# TYPE client_browser_timing histogram
client_browser_timing_bucket{event="connect",le="+Inf"} 40
client_browser_timing_bucket{event="connect",le="0.005"} 40
...

real	0m0.607s
user	0m0.020s
sys	0m0.027s

Why was this MR needed?

When any single Gitaly server fails, up to 50% of the web and api fleet workload is saturated by prometheus healthcheck requests, which happen 4 times a minute on each node, with each request currently taking almost a full minute, before failing with a 500 error.

With this fix: when any single Gitaly server fails: the prometheus endpoint will return immediately with the correct metrics and a 200 response.

Does this MR meet the acceptance criteria?

What are the relevant issue numbers?

Edited Jul 12, 2018 by Andrew Newdigate

Remove healthchecks from prometheus endpoint

What does this MR do?

Are there points in the code the reviewer needs to double check?

Why was this MR needed?

Does this MR meet the acceptance criteria?

What are the relevant issue numbers?

Merge request reports