Add a fast_timeout for the `ServerService.ServerInfo` endpoint (!20552) · Merge requests · GitLab.org / GitLab FOSS

Andrew Newdigate requested to merge gitaly-serverservice-info-timeout into master Jul 11, 2018

What does this MR do?

Adds a fast timeout for the ServerService.ServerInfo endpoint
Fixes https://gitlab.com/gitlab-org/gitlab-ce/issues/49116
Reduces the impact of https://gitlab.com/gitlab-org/gitlab-ce/issues/49112

# Simulate a Gitaly server under extreme load....
$ lldb -p $(pgrep gitaly) & 

# Without this fix: Prometheus scrape endpoint times out after ~55s with a 500 error
$ time curl http://localhost:3000/-/metrics
NameError at /-/metrics
=======================

> uninitialized constant GRPC::GRPC
...
real	0m58.135s
user	0m0.014s
sys	0m0.016s

# With this fix: Prometheus scrape endpoint times out after ~10s, returns the prometheus metrics successfully
$ time curl http://localhost:3000/-/metrics
client_browser_timing_count{event="contentComplete"} 40
# HELP client_browser_timing Multiprocess metric
# TYPE client_browser_timing histogram
client_browser_timing_bucket{event="connect",le="+Inf"} 40
client_browser_timing_bucket{event="connect",le="0.005"} 40
...

real	0m13.607s
user	0m0.020s
sys	0m0.027s

Why was this MR needed?

When any single Gitaly server fails, up to 50% of the web and api fleet workload is saturated by prometheus healthcheck requests, which happen 4 times a minute on each node, with each request currently taking almost a full minute, before failing with a 500 error.

With this fix: when any single Gitaly server fails: the prometheus endpoint will eventually return after 10 seconds with the correct metrics and a 200 response. https://gitlab.com/gitlab-org/gitlab-ce/issues/49112 will further correct this behaviour to detach healthchecking from prometheus scraping.

Does this MR meet the acceptance criteria?

What are the relevant issue numbers?

Fixes https://gitlab.com/gitlab-org/gitlab-ce/issues/49116
Reduces the impact of https://gitlab.com/gitlab-org/gitlab-ce/issues/49112

Edited Jul 12, 2018 by Rémy Coutable

Add a fast_timeout for the `ServerService.ServerInfo` endpoint

What does this MR do?

Why was this MR needed?

Does this MR meet the acceptance criteria?

What are the relevant issue numbers?

Merge request reports