Proposal: packfile cache for Git clone/fetch responses

We propose to add an optional feature to Gitaly where we de-duplicate and cache git clone and git fetch responses. The cached data would be stored in object storage (Google Cloud Storage) or on local disk with time-based expiry.

Computing Git clone responses is computationally expensive (CPU, RAM, IO) and while Git can use smart indexes to make it faster, it is wasteful to repeatedly perform the same computation for different users. The best example is a parallelised CI pipeline where e.g. 100 CI runners will want to clone the exact same branch to run tests on, and the Gitaly server has to compute the same response 100 times. This arbitrary recent CI pipeline on gitlab-org/gitlab for example has 129 parallel jobs in the test stage that all need to clone the same Git data.

The cache we are proposing would let Gitaly compute the result only 1 time, cache it in object storage, and serve all 100 parallel clones from that cached response. This is all done on the server side and requires no opt-in or cooperation from the user / CI.

This cache would work for both Git protocol v0/v1 and v2.

Benefits

The benefit of this cache would be that we get more headroom on our Gitaly servers. Gitaly servers are large/expensive and at constant risk of becoming bottlenecks depending on what users decide to throw at them. With more headroom we can have more consistent performance (happier users), achieve better utilization (save money), or both.

The flamegraph below was taken on the Gitaly server that hosts gitlab-org/gitlab under normal (non-duress) circumstances. It shows where the CPU's of the server spend their time over a 30 second interval. What we see is 46% of the time CPU's are idle, so 54% of the time they are busy. When you break that down further, you see that 25% of the CPU time was spent in git pack-objects. That means 25/54=46% of the non-idle CPU time went into git pack-objects. This is the the chunk of time that this cache should bring down.

In production#3161 (comment 462604949) we also have a flamegraph for what it looked like during a recent incident. The proportions are roughly the same: in that case, 57% of non-idle CPU spent on git pack-objects.

Thanks to precursor work in gitlab-org/gitaly#1657 (closed) we have a rough insight into how many Git HTTP clones and fetches are "identical" by more or less the same standards as the cache this issue proposes.

These two graphs suggest that out of Git HTTP clones of 50 MB and up, within a 30 minute window at most 1 in 3 are unique. That is an average and as such it masks outliers.

We also have this table which shows the most repeated Git HTTP clone requests in the half hour broken down by repository. In it we see that the worst offenders repeat the exact same clone of 100MB 100 times in 30 minutes, and in some cases even 2000 times in one day. This is in line with what we noted above about the degree of parallelism in gitlab-org/gitlab CI.

Related incidents

Technical details

There is a working proof of concept, with a video demo, in gitlab-org/gitaly!2832 (closed). Below we give a high level technical overview of how this proof of concept works.

Request flow

Ordinarily, the request flow for a git fetch or clone looks like this:

sequenceDiagram
    participant A as Gitaly (PostUploadPack)
    participant B as git-upload-pack
    participant C as git-pack-objects
    A->>B:fetch request
    B->>C:pack request
    C->>B:packfile data
    B->>A:fetch response

Instead, with the cache, we get this:

sequenceDiagram
participant A as Gitaly (PostUploadPack)
participant B as git-upload-pack
participant C as gitaly-hooks
participant E as Gitaly (PackObjectsHook)
participant D as git-pack-objects
    A->>B:fetch request
    B->>C:pack request
    C->>E:PackObjectsHook request
    E->>D:(cache miss) pack request
    D->>E:packfile data
    E->>C:PackObjectsHook response
    C->>B:packfile data
    B->>A:fetch response

A few things to note:

The PackObjectsHook calls use a local Unix socket. This is a pattern that already exists in Gitaly to handle Git hooks during a push (pre-receive, post-receive etc.).
An important part of making a cache like this work well is coordination between cache consumers during a cache miss, to make sure we create only one producer (the real git-pack-objects), and to make sure we can stream the output of the producer as it is being created rather than waiting for it to be completely stored in the cache (because creating the output can take minutes). Because the cache is local to a Gitaly process, we can use Go concurrency (channels, mutexes etc.) for the coordination which is simpler than coordinating across process boundaries.
The jump from git-upload-pack to gitaly-hooks happens because we would set the uploadpack.packObjectsHook config option on the git-upload-pack process.
It can be decided request by request whether or not to use the cache, for example with a feature flag.
Although the diagrams talk about PostUploadPack, the same thing applies to SSHUploadPack, and in fact cache entries may be hits for both PostUploadPack and SSHUploadPack: the cache is shared between Git SSH and Git HTTP.

Cache storage

This cache would store rather large entries. A full clone of gitlab-org/gitlab weighs about 1GB and a partial clone about 90MB. The large blobs cannot be reused. If only one branch changes, the entire blob becomes invalid. This is by nature of the Git packfile format. Therefore the cache needs:

The ability to store (write) large blobs quickly
Either unlimited storage (cloud object storage) or very short retention and active expiry

Besides that, another issue is reliability: the cache backend becomes a point of failure for Git clones.

The proof of concept in gitlab-org/gitaly!2832 (closed) uses object storage. Having thought about it more, I (JV) now think it would be better to use local disk storage. With an external dependency like object storage you have the failure scenario where the Gitaly server is up but object storage is down. With a local disk I think it is much less likely that the disk is down while the server is still up.

So we propose local disk storage.

The administrator configures a cache directory and an expiry time. It is up to the administrator to make sure the disk has enough space to hold the amount of data that can be created in the expiry time. Typically the cache directory would be on a dedicated filesystem.
Gitaly regularly walks the cache directory and deletes files older than the expiry time

Edited Dec 09, 2020 by Jacob Vosmaer