gRPC can become a bottleneck for Gitaly when serving Git fetch traffic

In &372 we added a cache to Gitaly to remove a bottleneck that can occur when serving Git fetch traffic, typically to CI fleets. When there are a lot of fetches at once the Gitaly server becomes CPU saturated generating the Git fetch response data. This is a problem because when this happens all RPC calls to that server slow down.

The cache appeared to have removed the "generating the data" bottleneck but when we tried to expose it to the intended workload (the CI fetches generated by gitlab-org/gitlab Merge Requests) the Gitaly server was still becoming CPU saturated, resulting in unacceptable slowdowns. At the end of our investigation into those slowdowns #1024 (closed), it looked like we were hitting a different bottleneck. After further inspection, it looks to be a bottleneck related to gRPC itself.

The status quo is that Git fetch traffic can push Gitaly servers into CPU saturation, even with the pack-objects cache enabled. Experiments in #1041 (closed) suggest that if we can avoid the gRPC bottleneck somehow, and use the pack-objects cache, then we can unlock 3x more network bandwidth from a single Gitaly server and we will saturate the network instead of the CPU.

Background

During &372 we did not think to add Gitaly itself to the "git process metrics" in Grafana until the end of the project. As a result, we unfortunately lack easily queryable historical data for this. We always collected Gitaly CPU usage but we never had it in the same graph with git upload-pack etc. Now that we do, it is clearer that Gitaly itself consumes enough CPU for us to take a closer look.

One thing I noticed is that Gitaly allocates lots of memory when serving Git fetch traffic. I looked into this a bit and concluded this is "just" part of how gRPC works. Or more specifically, the grpc-go v1.29.0 library that Gitaly uses. Here is an example from the GCP continuous profiler which uses a 10-second window when measuring heap allocations. The top line shows that during this sample a single Gitaly process allocated 40GB in 10 seconds, and looking at the graph you see it all happens in the Protobuf library and the grpc-go http/2 server.

profiler

gRPC is organized around "messages". From the point of view of Gitaly, it is constantly sending gRPC messages. It is up to grpc-go to then send these messages across the network to the client. What stood out to me here is that we spend a lot of CPU time doing "gRPC message sending stuff", compared to doing the actual network IO of writing bytes into a network socket.

profiler

Going back to allocations, it looks like when we transfer Git fetch data with gRPC, we are allocating O(N) memory per request, where N is the number of bytes we need to transfer. That is surprising because copying data should use O(1) memory per request.

These are things I had in the back of my mind when I tried to understand the slow-downs in #1024 (closed). While investigating that issue, I noticed that when the number of bytes transmitted per second to the network from the Gitaly server went up, the responsiveness of the server (apdex) went down.

The experiment

I had some difficulty imagining how I could ever test this hypothesis because I can't just sit down and quickly replace gRPC with something else in Gitaly. But I remembered that all I really care about here is Git fetch over HTTP, so I produced a toy Git HTTP front-end to play the role of GitLab Workhorse, and an RPC backend to play the role of Gitaly. My goal here was not to write production code but to have something that broadly gets the same job done using plain TCP sockets instead of gRPC. You can find the result here: rpctest.

On my laptop, the rpctest versions of Gitaly and GitLab Workhorse seemed to use a lot less CPU so I took the time to run an experiment on more realistic hardware. I created two c2-standard-30 virtual machines in GCP, which is the machine type we use for Gitaly servers on GitLab.com. I installed Omnibus on both machines. I then configured one machine as a GitLab server with gitaly disabled, and the other as a Gitaly server with everything except gitaly and prometheus disabled. The GitLab server was communicating with the Gitaly server via a plain (unencrypted) TCP connection.

To generate load I used gitaly-debug analyze-http-clone to clone from localhost on the GitLab machine. I ran 30 instances of gitaly-debug in a loop cloning the same repository. I did my experiments with two different repositories that I imported from GitLab.com: gitlab-org/gitlab and gitlab-com/www-gitlab-com. I let each experiment run at least 10 minutes before gathering data with Prometheus on the Gitaly server.

For each repository I tested a matrix of 4 (2 times 2) experiments. Regular GitLab vs rpctest, and pack-objects cache on vs off. In the "pack-objects cache on" scenario we effectively had a 100% cache hit rate because the load generator kept requesting the same Git fetch data.

For each experiment I collected three metrics.

Network egress: the number of bytes/s transmitted out the network interface of the Gitaly server. Higher is better. The maximum GCP will give us on these machines is 4000MB/s.
System load: Linux system load "load factor" divided by number of CPU's. Lower is better. Values above 1 correlate with the server becoming less responsive.
CPU Utilization percentage: non-idle CPU time divided by number of CPU's. Lower is better. The maximum is 100%. On multi-core systems, it is possible to saturate the CPU before this number reaches 100% if the workload is not sufficiently parallelized.

Workload	Network egress MB/s	System load	CPU utilization %
`gitlab-com/www-gitlab-com`
baseline: regular GitLab, no cache	920	1.48	83
regular GitLab + pack-objects cache	1070	1.46	75
rpctest, no cache	2240	1.89	98
rpctest with pack-objects cache	3890	0.57	54
`gitlab-org/gitlab`
baseline: regular GitLab, no cache	700	1.53	88
regular GitLab + pack-objects cache	1075	1.14	73
rpctest, no cache	1180	1.29	93
rpctest with pack-objects cache	3910	0.65	59

See #1041 (comment 563479326) for flamegraphs.

Observations

In all scenarios except "rpctest with cache", the test workload saturates the CPU of the Gitaly server
With both regular GitLab and rpctest, enabling the cache improves throughput and reduces CPU load
In the "rpctest with cache" scenario, we hit the GCP network bandwidth limit with CPU headroom to spare
The bandwidth multiplier going from "regular gitlab with cache" to "rpctest with cache" is over 3x

Conclusion

I think we now know there indeed is a gRPC bottleneck when it comes to high volumes of Git fetch traffic. I can investigate further if there is something we can practically do about this.

Update: The investigation was performed here: #1041 (closed)

Edited Jun 21, 2021 by Rachel Nienaber