Add Per-Repo Throttling to Gitaly
-
Server Implementation: !376 (merged) -
~"Acceptance Testing": #618 (closed)
Blocked On
As an outcome of the July 20th outage: https://gitlab.com/gitlab-com/infrastructure/issues/2314
First, watch https://youtu.be/f7ecUqHxD7o?t=8m37s
Add rate limiting, probably through a grpc-middleware (which can rely on the repoPath set in the context from the upstream grpc-middleware) to limit the number of concurrent requests per repo.
Any requests outside the bound would be queued until other requests for that repo have completed.
Also, log the amount of time a repository is throttled, using structured logging and Prometheus metrics and add alerts for these metrics to ensure that we are aware when a repository is being throttled.
In addition to the per-repo throttle, I would suggest that we also add a global throttle for additional resilience.
We now believe that some of the load spikes on gitlab.com's Gitaly servers are due to git clone
waves from CI. A simple way to defend against that is to add a throttle that limits the number of concurrent clones on a single repository.
We can build this as a middleware. Here is a possible starting point: https://github.com/yaronsumel/grpc-throttle/blob/master/throttle.go
Suggested requirements:
- configurable via config.toml (see below)
- track number of waiting requests per rpc in a prometheus gauge (keyed by rpc, so the total across all repos)
- log the number of milliseconds spent queueing
Config example:
[[throttle]]
rpc = "/gitaly.SmartHTTP/PostUploadPack"
max_per_repo = 10