Investigate feasibility of burst clone cache

There appears to be some interest in a caching mechanism that would accelerate bursts of identical clone requests. This is supposedly something that GitHub does but a quick google search did not give me any source for this. Regardless, this idea is worth considering.

Generally speaking, it is expensive to cache the server response of a git clone because that response is as large as the repository. It is nice to trade space for time but storage space is not cheap.

Besides the fact that storage space is not free, it would be pointless to build a cache that will never achieve a good hit/miss ratio.

So if we want to explore this idea of a burst clone cache, I think the first thing that needs to happen is that we start logging metadata about clones that would allow us to see if there is potential for caching. I.e. we can design a cache key and log each time a response is served that matches that key. If we only see unique keys in our log there is no potential for caching. If there are repeat requests with the same key, then the question is how they are spaced in time. How long would a cache entry have to stick around for it to be useful for a good number of requests? How would we handle the problem of concurrent requests that have the same cache key. We don't want to populate the cache twice. Do we make requests wait until the cache is populated? How long? What if cache population receives back pressure from the network bandwidth of a single Git client?

So, lots of questions. I think the sane way to approach this would be to start logging metadata about clone responses. This alone requires a technical effort. I think the way I would do this is:

install global pack-objects hook
hook records metadata about inputs and proxies to the real git pack-objects
metadata gets logged (to ELK)

Then we deploy this and wait for real world data.

Do we want to investigate this?

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information