Privileged endpoint to get the current (possibly inconsistent) state of a repository

This is in progress. There are many moving parts, so I'm documenting them here.

The approach we're taking is to add two Gitaly RPCs - GetSnapshot and CreateRepositoryFromSnapshot:

In GitLab, we add an API endpoint: /api/v4/projects/:id/raw_archive. This uses workhorse to trigger the GetRawArchive RPC, downloading a .tar (note: not compressed) archive of the specified bare repository:

gitlab-workhorse!248 (merged)
https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/18327 (update workhorse MR)
https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/18173 (ready)

In CE, this will only be available to instance administrators, and it won't be used in any automated way, or linked to from the UI. Handy for backups and GDPR requests, perhaps.

In EE, we'll add Geo JWT authentication to the API endpoint, and a method that will call the CreateRepositoryFromRawArchive RPC in a worker, pointing it at the API endpoint with a JWT.

This will be triggerable from either the Rails console, or a UI element.

gitlab-ee: https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/5313 (ready)

Alles klaar?

This feature missed the 7th so I opened an exception request: gitlab-org/release/tasks#162 (closed)

Here's a diagram of the various connections we're making once everything has been strung together:

gprd sidekiq:grpc client(`CreateRepositoryFromSnapshot`) ->
  gprd gitaly:grpc server ->
    gprd gitaly: https client ->
      gitlab.com workhorse:https client -> 
        gitlab.com unicorn:https server(`GET /api/v4/projects/:id/snapshot` HTTP authentication request) ->
          OK:
            gitlab.com workhorse:grpc client(`GetSnapshot`) ->
              gitlab.com gitaly:grpc server ->
                Generate and stream a tar archive containing a whitelist of files from the specified repository
          NOT OK:
            401 unauthorized
  OK:
    Unpack the tar archive to disk

Every TCP connection is secured with TLS.

The grpc client <-> grpc server steps are secured using the pre-existing Gitaly authentication mechanisms (ISTR this is a token of some kind).

The gprd gitaly <-> gitlab.com workhorse (inter-cluster, gprd http client talking to gitlab.com API) step is secured using the pre-existing Geo authentication mechanism (generating time-limited JWTs), or the pre-existing admin/auditor auth mechanisms (private tokens, etc)

The 'smart' git HTTP fetch protocol can be CPU-intensive on initial clone. To reduce load in some scenarios (such as Geo initial replication), it seems to make sense to allow the client to:

Download a (possibly inconsistent) snapshot of the repository using an API endpoint
Run git fetch, git fsck, and git gc in the downloaded snapshot to make it consistent

This moves much CPU usage to the client side, rather than the server side, speeding up the total process significantly.

To this end, we need to add a privileged endpoint in gitlab-ce which defers to workhorse + gitaly to actually get the bytes. These can be served as a simple tar file containing the existing packfiles, possibly any loose objects, and possibly the refs db + config.

Note: packfiles are already compressed, so there's no need to compress it again.

If we take a snapshot of the repository while a write is in progress, there's a chance that the snapshot will be inconsistent in some form. This is an acceptable outcome, and doesn't defeat the purpose of the endpoint.

We could, in theory, provide a "guarantee consistency" flag. If set, we'd attempt to gain an exclusive write lock on the repository (competing with git push and configuration updates) before requesting the snapshot. I don't think we need it for the first iteration though.

Original proposal

Currently, GitLab only supports the 'smart' git HTTP fetch protocol. This consists of running git info-refs and git upload-pack processes in gitaly.

The smart protocol has many advantages over the dumb protocol, but it has one major downside - it's much more CPU-intensive. For operations where you're starting with no repository on the client side at all, you can be looking at an extended period of high-CPU processing to generate a packfile on the fly that can be sent to the client.

This is an issue for two operations internally in GitLab: Geo repository backfill, and repository forking.

Rather than running git update-server-info on git push, we could implement the two required endpoints dynamically - foo.git/info/refs and foo.git/objects/info/packs

From there, I believe the client just needs to be able to download the packfiles in foo.git/objects/packs that are referenced in objects/info/packs, but this needs validating. It may need access to other files as well.

On the client side, we may need some way to force git to to use the dumb transfer protocol for these special operations.

Thanks to @jacobvosmaer-gitlab for coming up with the initial idea of downloading all the packfiles to initialize the repository!

/cc @jramsay @stanhu @DouweM

Edited Apr 12, 2018 by Nick Thomas