pack objects concurrency limiting

In Gitaly, we have several knobs to prevent an over-flooding of traffic that can saturate and bring down a gitaly node. Namely, [[rate_limiting]] where a rate limit can be defined per repository/RPC, and [[concurrency]] where max_per_repo, max_queue_wait, and max_queue_size can all be adjusted to stem the flow of traffic for a given RPC per repository.

All of these knobs are aimed at preventing the situation where a flood of clone traffic overwhelms a Gitaly server leading to cascading errors. However, though limiting the RPCs SSHUploadPackWithSidechannel and PostUploadPackWithSidechannel are a rough proxy for this, it's not quite precise enough. One reason is because on Gitaly we have a pack objects cache, and if an RPC has high concurrency but is served pack data through the cache then it doesn't create load on the CPUs. In this situation, it would be unnecessarily restricting to put a limit on concurrency leading to a poor user experience.

The CPU heavy process we care about in the end is pack-objects, which git spawns to create a packfile that it will send over the network to the client. Putting limits on concurrency pack object processes that get spawned will be a much more direct knob to preventing incidents due to traffic.

Recently, we've put logging into place that shows how many concurrent git-pack-objects processes get spawned--broken out by user_id, and repository.

top concurrency values broken out by user

top concurrency values broken out by repository

As we can observe, when broken out by user we see spikes of very high concurrency as compared to broken out by repository. This suggests there are users who are responsible for huge spikes of pack-objects processes. This is consistent with our experience in production incidents:

In each of these incidents, mitigation involved blocking the user that was responsible for the traffic. If we can do this automatically and slow down such users or at a certain point return errors to just that user, then we can protect our fleet without penalizing other users.

We can add concurrency limits on the pack-objects processes itself and limit concurrency per user_id. This way we have a direct way to stem the flurry of pack-objects processes in order to protect our Gitaly fleet.

Edited Aug 01, 2022 by John Cai

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information