Gitaly: Offload large blobs to secondary storage
Monorepos are monolithic for any number of reasons. One of them being large blobs. Git does not have a native way to handle large blobs well. `git-lfs` is a tool that can be used, but some setup is required on both the client and server side. Having a way to offload large blobs to an object storage would go a long way in improving performance of repositories with large files.
We can use the following approach to save storage costs by pushing large blobs to a secondary storage.
On a git push:
```mermaid
graph TD
client[Git client] -->|push| S(Git server)
S --> U(server update hook)
U -->|http post| OS[Object Storage]
```
Let's say a git push contains a blob of 800mb. We can have a server hook that detects this, and uploads this blob to an object storage.
However, this is only part of the story because that 800mb blob still gets pushed to the server. We also need a way to remove the 800mb blob from the server.
Currently, Git supports partial clone (which allows filtering out large objects when cloning and fetching, see https://docs.gitlab.com/ee/topics/git/partial_clone.html). We can modify `git-repack` to delete objects based on a filter eg:
`git repack -a -d --filter=blob:limit=800`
This way, when housekeeping runs we shed objects that we have already uploaded to object storage.
Then, on a git fetch (that needs some objects we uploaded to object storage and have been shed):
```mermaid
graph TD
client[Git client] -->|fetch| S(Git server)
client -->|fetch| RH(git remote helper)
RH --> OS[Object Storage]
```
A [git remote helper](https://git-scm.com/docs/gitremote-helpers) can be used to get blobs from the object storage. The object storage needs to be configured as a partial clone remote using the remote helper on the client. When the server cannot provide one object it needs, the client will try to find it on the partial clone remotes.
More specifically in the case of Gitlab, we will have the following flow:
### On a Push
```mermaid
graph TD
client[Git client] -->|push| GW(Gitlab Workhorse)
GW --> PostReceivePack
subgraph Gitaly
PostReceivePack --> PreReceiveHook
end
PreReceiveHook --> gl(Gitlab /allowed)
gl -->|status: 'allowed', filter: 'blob:limit=1g'| PreReceiveHook
PreReceiveHook -->|upload objects larger than 1g| cloud((Cloud Storage))
```
Note: We need to add the ability for Gitlab to configure per project blob offloading filters, and to turn it on or off.
### On a fetch
```mermaid
graph TD
client[Git client] -->|1. fetch| GW(Gitlab Workhorse)
GW -->|2| PostUploadPack
subgraph Gitaly
PostUploadPack -->|3| git
git -->|4| git-remote-helper
end
git-remote-helper -->|5| cloud((Cloud Storage))
cloud -->|6. stream object| git-remote-helper
git-remote-helper -->|7| git
git -->|8| PostUploadPack
PostUploadPack -->|9| client
```
## Issue Bird View
```mermaid
graph TD
#6162["✅ #6162 test instance"]
click #6162 "https://gitlab.com/gitlab-org/gitaly/-/issues/6162" "Create GCP test instance with blob offloading POC"
#4605["✅ #4605 poc rmt helper"]
click #4605 "https://gitlab.com/gitlab-org/gitaly/-/issues/4605" "Add git remote helper that knows how to download blobs from google cloud storage (bucket) remote"
#6257["⏳#6257"]
click #6257 "https://gitlab.com/gitlab-org/gitaly/-/issues/6257" "Discussion: Blob offloading and RAFT and WAL"
#6245["⏳#6245"]
click #6245 "https://gitlab.com/gitlab-org/gitaly/-/issues/6245" "Discussion: Blob offloading and object pool"
#6217["#6217 mgr"]
click #6217 "https://gitlab.com/gitlab-org/gitaly/-/issues/6217" "Add a offloading manager module into Gitaly"
#6216["#6216 ff"]
click #6216 "https://gitlab.com/gitlab-org/gitaly/-/issues/6216" "featureflag: Add a feature flag about blob offloading"
#6191["#6191 pruning"]
click #6191 "https://gitlab.com/gitlab-org/gitaly/-/issues/6191" "Pruning process should consider transaction"
#6120["⏳ #6120"]
click #6120 "https://gitlab.com/gitlab-org/gitaly/-/issues/6120" "Discussion: Blob offloading in object storage with transactions"
#5955["⏳ #5955 blueprint"]
click #5955 "https://gitlab.com/gitlab-org/gitaly/-/issues/5955" "Blueprint on offloading large objects to secondary storage"
#6077["#6077 pass config"]
click #6077 "https://gitlab.com/gitlab-org/gitaly/-/issues/6077" "Pass gc bucket credential properly"
#6268["#6268 GS helper to Gitaly"]
click #6268 "https://gitlab.com/gitlab-org/gitaly/-/issues/6268" "git remote helper: Migrate GS helper to Gitaly codebase"
#6274["#6274 list of obj"]
click #6274 "https://gitlab.com/gitlab-org/gitaly/-/issues/6274" "offloadManager: keep a record of offloaded objects"
#6305["⏳ #6305"]
click #6305 "https://gitlab.com/gitlab-org/gitaly/-/issues/6305" "Discussion: Avoid Git remote helper"
#6306["#6306"]
click #6306 "https://gitlab.com/gitlab-org/gitaly/-/issues/6306" "Discussion: Expose Signing Agent"
#6307["#6307"]
click #6307 "https://gitlab.com/gitlab-org/gitaly/-/issues/6307" "Discussion: Pass filter"
#335["⏳ git#335"]
click #335 "https://gitlab.com/gitlab-org/git/-/issues/335" "cat-file: add --batch-command remote-object-info command"
#294["git#294"]
click #294 "https://gitlab.com/gitlab-org/git/-/issues/294" "git-cat-file: Support for reading object type and content from a remote without downloading them"
#332["⏳ git#332"]
click #332 "https://gitlab.com/gitlab-org/git/-/issues/332" "Allow fetching directly from promisor remote when cloning"
%% #example["⏳ ❌ ✅ #emp"]
%% click issue6136 "https://mylink" "my description"
#6216 --> #6217 --> #6191
#6217 --> #6077
#6217 --> #6274
#6305 --> #6306
#6305 --> #6307
#332 --> #6217
#5955 --> #6257
#5955 --> #6245
#5955 --> #6120
#5955 --> #6305
#335 --> #294
#4605 --> #6268
```
epic