Privileged endpoint to get the current (possibly inconsistent) state of a repository
The 'smart' git HTTP fetch protocol can be CPU-intensive on initial clone. To reduce load in some scenarios (such as Geo initial replication), it seems to make sense to allow the client to:
- Download a (possibly inconsistent) snapshot of the repository using an API endpoint
git fsck, and
git gcin the downloaded snapshot to make it consistent
This moves much CPU usage to the client side, rather than the server side, speeding up the total process significantly.
To this end, we need to add a privileged endpoint in gitlab-ce which defers to workhorse + gitaly to actually get the bytes. These can be served as a simple tar file containing the existing packfiles, possibly any loose objects, and possibly the refs db + config.
Note: packfiles are already compressed, so there's no need to compress it again.
If we take a snapshot of the repository while a write is in progress, there's a chance that the snapshot will be inconsistent in some form. This is an acceptable outcome, and doesn't defeat the purpose of the endpoint.
We could, in theory, provide a "guarantee consistency" flag. If set, we'd attempt to gain an exclusive write lock on the repository (competing with
git push and configuration updates) before requesting the snapshot. I don't think we need it for the first iteration though.
Currently, GitLab only supports the 'smart' git HTTP fetch protocol. This consists of running
git info-refs and
git upload-pack processes in gitaly.
The smart protocol has many advantages over the dumb protocol, but it has one major downside - it's much more CPU-intensive. For operations where you're starting with no repository on the client side at all, you can be looking at an extended period of high-CPU processing to generate a packfile on the fly that can be sent to the client.
This is an issue for two operations internally in GitLab: Geo repository backfill, and repository forking.
Rather than running
git update-server-info on
git push, we could implement the two required endpoints dynamically -
From there, I believe the client just needs to be able to download the packfiles in
foo.git/objects/packs that are referenced in
objects/info/packs, but this needs validating. It may need access to other files as well.
On the client side, we may need some way to force git to to use the dumb transfer protocol for these special operations.
Thanks to @jacobvosmaer-gitlab for coming up with the initial idea of downloading all the packfiles to initialize the repository!