pack objects concurrency limiting
In Gitaly, we have several knobs to prevent an over-flooding of traffic that can saturate and bring down a gitaly node. Namely, [[rate_limiting]]
where a rate limit can be defined per repository/RPC, and [[concurrency]]
where max_per_repo
, max_queue_wait
, and max_queue_size
can all be adjusted to stem the flow of traffic for a given RPC per repository.
All of these knobs are aimed at preventing the situation where a flood of clone traffic overwhelms a Gitaly server leading to cascading errors. However, though limiting the RPCs SSHUploadPackWithSidechannel
and PostUploadPackWithSidechannel
are a rough proxy for this, it's not quite precise enough. One reason is because on Gitaly we have a pack objects cache, and if an RPC has high concurrency but is served pack data through the cache then it doesn't create load on the CPUs. In this situation, it would be unnecessarily restricting to put a limit on concurrency leading to a poor user experience.
The CPU heavy process we care about in the end is pack-objects, which git spawns to create a packfile that it will send over the network to the client. Putting limits on concurrency pack object processes that get spawned will be a much more direct knob to preventing incidents due to traffic.
Recently, we've put logging into place that shows how many concurrent git-pack-objects
processes get spawned--broken out by user_id, and repository.
top concurrency values broken out by user
top concurrency values broken out by repository
As we can observe, when broken out by user we see spikes of very high concurrency as compared to broken out by repository. This suggests there are users who are responsible for huge spikes of pack-objects
processes. This is consistent with our experience in production incidents:
- gitlab-com/gl-infra/production#7484 (closed)
- gitlab-com/gl-infra/production#2457 (closed)
- gitlab-com/gl-infra/production#2600 (closed)
In each of these incidents, mitigation involved blocking the user that was responsible for the traffic. If we can do this automatically and slow down such users or at a certain point return errors to just that user, then we can protect our fleet without penalizing other users.
We can add concurrency limits on the pack-objects
processes itself and limit concurrency per user_id. This way we have a direct way to stem the flurry of pack-objects
processes in order to protect our Gitaly fleet.