Proposal: Offloading large blobs
Monorepos are monolithic for any number of reasons. One of them being large blobs. Git does not have a native way to handle large blobs well. git-lfs
is a tool that can be used, but some setup is required on both the client and server side. Having a way to offload large blobs to an object storage would go a long way in improving performance of repositories with large files.
Proposal 1: Clear server of blobs periodically
Note: this idea comes from @chriscool
On a git push:
graph TD
client[Git client] -->|push| S(Git server)
S --> U(server update hook)
U -->|http post| OS[Object Storage]
Let's say a git push contains a blob of 800mb. We can have a server hook that detects this, and uploads this blob to an object storage.
However, this is only part of the story because that 800mb blob still gets pushed to the server. We also need a way to remove the 800mb blob from the server.
Currently, git supports sparse checkout. We can modify git-repack
to delete objects based on a filter eg:
git repack -a -d --filter=blob:limit=800
This way, when housekeeping runs we shed objects that we have already uploaded to the server.
Then, on a git fetch:
graph TD
client[Git client] -->|fetch| S(Git server)
S --> RH(git remote helper)
RH --> OS[Object Storage]
A git remote helper can be used to get blobs from the object storage.
This proposal would require the following work:
- Git upstream changes to
git-repack
to shed blobs based on a filter. - Modify Gitaly's update hook to upload blobs to an object storage (this would probably have to talk to Rails to get the object storage information)
- Develop a binary that can serve as the git remote helper that talks to the object storage