Skip to content

Add a fast_timeout for the `ServerService.ServerInfo` endpoint

Andrew Newdigate requested to merge gitaly-serverservice-info-timeout into master

What does this MR do?

# Simulate a Gitaly server under extreme load....
$ lldb -p $(pgrep gitaly) & 

# Without this fix: Prometheus scrape endpoint times out after ~55s with a 500 error
$ time curl http://localhost:3000/-/metrics
NameError at /-/metrics
=======================

> uninitialized constant GRPC::GRPC
...
real	0m58.135s
user	0m0.014s
sys	0m0.016s

# With this fix: Prometheus scrape endpoint times out after ~10s, returns the prometheus metrics successfully
$ time curl http://localhost:3000/-/metrics
client_browser_timing_count{event="contentComplete"} 40
# HELP client_browser_timing Multiprocess metric
# TYPE client_browser_timing histogram
client_browser_timing_bucket{event="connect",le="+Inf"} 40
client_browser_timing_bucket{event="connect",le="0.005"} 40
...

real	0m13.607s
user	0m0.020s
sys	0m0.027s

Why was this MR needed?

When any single Gitaly server fails, up to 50% of the web and api fleet workload is saturated by prometheus healthcheck requests, which happen 4 times a minute on each node, with each request currently taking almost a full minute, before failing with a 500 error.

With this fix: when any single Gitaly server fails: the prometheus endpoint will eventually return after 10 seconds with the correct metrics and a 200 response. https://gitlab.com/gitlab-org/gitlab-ce/issues/49112 will further correct this behaviour to detach healthchecking from prometheus scraping.

Does this MR meet the acceptance criteria?

What are the relevant issue numbers?

Edited by Rémy Coutable

Merge request reports