Aggresive Gitaly Timeouts
Introduction
In #429 (closed) and #650 we're addressing noisy clients by "putting them to the back of the line".
While this protects Gitaly from traffic surges, it could potentially have a negative effect on upstream unicorn services.
The Gitlab ruby monolith runs inside a unicorn container, with each request being handled by it's own process. Once the request is complete, the process is free to handle the next request. This architectures tends to make unicorn less "elastic" than event-driven architectures (and to a lesser degree, also multithreaded servers), since the cost of waiting for downstream services is an entire unicorn worker process and the number of these in the system is finite.
One analogy that works is of GitLab.com traffic as a motorway, with unicorn being a narrow section (imagine a bridge) and rate-limiting as a narrow ("rate-limited") offramp on the other side of the bridge.
Please excuse my awful artwork.
Now imagine that there was an event at the end of the offramp and thousands of city folk from the city were all trying to drive down it.
The traffic begins to queue for the offramp. This is fine as it's not affecting other traffic.
However, at some point the queue backs up to the point where it blocks all traffic on the bridge and therefore causes all traffic on the motorway to grind to a halt.
The reason this is happening is that the capacity of the bridge has reached it's maximum and the flow of traffic is deadlocked by a rate-limited offramp beyond.
This is similar the situation happening in unicorn, when a surge of traffic can lead to all unicorn capacity being blocked by limited downstream flow, such as a slow NFS mount or a rate limit on an extremely busy repository.
Workaround
Currently, Gitaly will cancel a request when the client goes away. In the case of unicorn, this might be after a 60 second timeout.
I propose that we setup a configurable threshold timeout (with a default of 1 minute for example), lets use
GITALY_CALL_TIMEOUT_SECONDS = 30
From this value, we could derive further timeouts:
FAST_GITALY_CALL_TIMEOUT = GITALY_CALL_TIMEOUT_SECONDS / 10 # Defaults to 3 seconds
MEDIUM_GITALY_CALL_TIMEOUT = GITALY_CALL_TIMEOUT_SECONDS / 2 # Defaults to 15 seconds
Then, in the Gitaly client stubs in GitLab rails, we setup default (but possibly per-call overridable) timeout values for each method call. For the majority of calls
For example, for fast calls, such as CommitIsAncestor
or RefExists
, we would use FAST_GITALY_CALL_TIMEOUT
(3 seconds maximum call time, including queuing)
For slow calls, such as CommitLanguages
we could use the full GITALY_CALL_TIMEOUT_SECONDS
.
Minimum Viable Change
I propose that as a first step, we setup a maximum call time, configured to 30 seconds, on all Gitaly calls in GitLab Rails.
The next step would be to lower some of these timeouts, for calls that we expect to be fast.
We can monitor the effect of this and then work on improving the rate limiting as per #650
Follow-on Work
Down the line, we'll need to do more to handle these timeout errors gracefully (possibly by masking the git section of the UI) but this can come later.