Explore the use of bundle-uri
Summary
This issue explores how we can use git bundle URI in Gitaly an GitLab.
Git bundle URIs are locations where Git can download one or more bundles in order to bootstrap the object database in advance of fetching the remaining objects from a remote.
When a client calls git-clone(1) or git-fetch(1), the server can advertise one or more URLs where the client can download bundle(s) from. Those URLs are simply served through HTTP(S).
Opportunities
By having the client download bundles in advance, it reduces the CPU load on the server because it only needs to build smaller packs as the client already has some content of the repository partially downloaded.
This also benefits the users and should reduce the latency.
Before we enable the feature, it would be worth noting the CPU/memory usage, so we can then compare the improvements bought around by bundle URIs.
Reduce load from CI
CI jobs often clone the same version of repository at once. But the server has to serve content to each of them. When they can use a pre-created bundle, it reduces the load on the server.
Geo-distribute bundles
Because these bundle can be served through HTTPS, they can be put on a CDN, and clients can be routed to the server the closest to them.
Challenges
How often/many bundles to publish?
The bundle served by GitLab should be updated on a regular basis. We might want to consider serving at least two bundles:
- An "initial" bundle
- An "incremental" bundle on top of the previous
When users do a git clone, they will download both. But when the have cloned in
the past already, they might only need the latter. When the server advertises
bundles we need to make sure the client will not be downloading a huge bundle
(i.e. the "initial" bundle), while it already has everything inside there. We'll
need to tweak which bundles we make and how we advertise them with relevant
bundle.heuristic
properties.
Where to store bundle metadata?
When Gitaly creates the bundles, and puts them e.g. on a CDN, it needs to know
where they are to advertise them to the client. We could make the bundle URL deterministic
and hence at runtime we can generate it, something like base_url + repo_hash + /all.bundle
.
Authorization
When a bundle file is published through HTTP(S) we have to be careful this does not make it possible for a malicious user to download that bundle to obtain private data from the repository. Signed URLs in the context of GCP might be an option to look into.
Enabling the feature
GitLab can advertise bundle URIs, but at the moment Git is not using them by default. The client has to explicitly configure Git to use bundle URIs, when advertised.
Initial iteration
For the first iteration of using bundle URI we might want to look at CI jobs first. This simplifies a few challenges above:
- Incremental bundles: CI uses git-clone(1) (when configured as such) to fetch the repository. So it's doesn't need to worry about incremental bundles.
- Authorization: We have more control on what a CI runner can access, so it might be easier to protect the bundle URI.
- Enabling the feature: Because we control how the runner calls git-clone(1), it's easier to make sure the bundle URIs get used