Geo: Use `git clone` for first sync instead of `git fetch`

I feel like I went through this when looking at git 2.14 vs git 2.16: gitlab-com/migration#216 (comment 60539025)

On GPRD, we are using the USE_SYSTEM_GIT_FOR_FETCH environment variable to use git 2.16 until we can ship 2.16 with GitLab.

The main remaining task there is an systematic, automatic sweep of all subtrees: https://gitlab.com/gitlab-org/gitlab-ce/issues/44115

We have Gitaly support for removing stale worktrees as part of garbage collection: gitaly#1074 (closed), gitlab-com/migration#216 (closed)

Right. Is git clone still two orders of magnitude faster than git fetch with git 2.16 ?

It seems the git clone flow is:

Create empty bare repository
Populate remote-tracking branches
Run git fetch without arguments

If we can replicate that second step, so we go:

git init
...something that makes git branch -r have the correct results, and ensures the refs end up in packed-refs
git fetch

then we can use it on every fetch, not just the initial one.

I've found git remote update, but it seems to pull objects as well. I'm not really clear on how it differs from git fetch for our case.

I had a quick discussion about the differences between init ; fetch and clone in the #git channel on freenode. Outcomes:

The --prune argument to git fetch causes us to do a scan of refs which doesn't have to happen in the clone stage. Unsure how expensive this is when there are no refs on the local side.

We could use git ls-remote geo > packed-refs to pre-populate the refs after init, but before fetch. This creates a "broken" repository in that there are references pointing to objects we don't have; the subsequent fetch then acts much more like it does in the clone case.

The output format of ls-remote isn't quite correct - peeled tags are in slightly the wrong format. We could munge the output to be correct, or exclude peeled refs using --refs. I don't think the latter would have any negative data consequences, since the git fetch would do the peeling for us, but it might have negative performance implications. I don't know if the clone creates an initial packed-refs with peeled or unpeeled output.

Hmm, the ls-remote > packed-refs; fetch approach gives us more refs than the clone approach. refs/merge-requests seems to be missing if we just do a clone, for instance.

@nick.thomas the missing ones must be because you forget --mirror? Or are we really missing things here? Also on the other issue, I suggested we could filter out some refs, and that would still be fine for Geo like : refs/tmp/* and refs/remotes

Looks like if we remove the --prune it does behave like the clone (trying to confirm that). If this is true, I think we should make prune a separate step, as ~~it looks like prune algorithm is the one generating non packed refs~~.

Nope, it stills generate it :/

-rw-r--r--  1 root root 131M Mar 26 18:46 git-trace-2-16-2.txt
-rw-r--r--  1 root root 323M Mar 27 11:40 git-trace-fetch-system.txt

This is just an example of how different fetch is from cloning. A lot more work is done when using fetch

Here are my 2 proposals:

Switch to git clone to do the initial sync
Change our filter strategy (when detected it's a geo sync) to filter out refs/tmp/* and refs/remotes

gitlab-com/migration#270 (closed) suggests that switching to git clone won't help in at least some cases, unfortunately. On the server side, building the packfile simply takes too long, and using git clone doesn't speed it up to the point where it works in our environment.

(I think we could solve the authentication problem by setting GIT_CONFIG before shelling out to git clone, incidentally, but I don't think we should do this)

Ok I think we are talking about two different things.

The project that failed to clone here: gitlab-com/migration#270 (closed), failed because the source couldn't generate everything it needed or because the target machine (the one running clone) exhausted resources?

If later, then it would also break trying to pack the repository, which means we are screwed anyway.

What I want to fix in this issue is the "repository is taking too long to run git log because we have an unpacked ref.

As a bonus which may also help in your situation, 2o proposal should reduce the amount of garbage we replicate that is not needed.

Right, it seems to me this issue is very simple: git clone appears to pack refs automatically, but git fetch does not. Compare the two tests:

git clone --bare git@github.com:jneen/rouge.git
mkdir test; cd test; git init --bare .; git remote add origin git@github.com:jneen/rouge.git; git fetch origin

You'll see that in the first case, you always get a packed-refs file, whereas in the second case you do not.

This means regardless of large or small repositories, it's always better to do a git clone to initiate. It won't help us solve the issue with large repositories, but it will get us into a better state.

What do we do with our existing repositories on GPRD? Rather than redownload them all, it seems like it's easier just to run a git gc --no-prune to optimize the existing refs.

I think I'd prefer to handle our existing dataset with a one-time maintenance task.

For packing refs in the future, I'd prefer a patch that adds git pack-refs --all command to the existing git fetch-using sequence. This will be faster and lower-impact than rearchitecting geo, Gitlab::Shell and gitaly to add a clone method RPC.

gitaly#990 (closed) will also preserve packed-refs present on the primary

For packing refs in the future, I'd prefer a patch that adds git pack-refs --all command to the existing git fetch-using sequence. This will be faster and lower-impact than rearchitecting geo, Gitlab::Shell and gitaly to add a clone method R

What about hooking into our HousekeepingService, which manages the git gc on the push side of things? We should perhaps hook into the pull side of things in the same way.

Hmm, perhaps. We want to trigger a full repack, and the HousekeepingService will do that for us under some conditions:

    def task
      if pushes_since_gc % gc_period == 0
        :gc
      elsif pushes_since_gc % full_repack_period == 0
        :full_repack
      else
        :incremental_repack
      end
    end

These values come from configuration and it's interesting to note that if the periods are shared, a full repack will never be run.

I don't think we'd be able to reuse the housekeeping service directly, since we'd need to keep track of "pulls since last run" in the project registry, rather than the project. The service is quite a thin wrapper around GitGarbageCollectWorker.perform_async(@project.id, :full_repack...) anyway, so we don't lose much by duplicating parts of it for the geo pull case.

I've talked about this with @nick.thomas and here is the results for profilling of clone vs fetch with git 2.16.2:

#!/bin/bash
git init gitlab-ce
cd gitlab-ce
git remote add --mirror=fetch geo root@128.199.34.37:gitlab-ce.git
git config --add remote.geo.mirror true
git config --add core.bare true
git fetch geo --tags --quiet
git pack-refs --all

time ./sync-ce-fetch.sh
Initialized empty Git repository in /root/gitlab-ce/.git/

real    3m5.468s
user    2m32.344s
sys     0m28.604s

#!/bin/bash
git clone --mirror root@128.199.34.37:gitlab-ce.git

time ./sync-ce-clone.sh
Cloning into bare repository 'gitlab-ce.git'...
remote: Counting objects: 1582653, done.
remote: Compressing objects: 100% (509/509), done.
remote: Total 1582653 (delta 401), reused 381 (delta 192)
Receiving objects: 100% (1582653/1582653), 534.75 MiB | 5.19 MiB/s, done.
Resolving deltas: 100% (1270613/1270613), done.

real    2m29.222s
user    2m19.024s
sys     0m15.904s

Geo: Use `git clone` for first sync instead of `git fetch`

Description

Proposal

Links / references

Designs

Child items ...

Activity

Geo: Use `git clone` for first sync instead of `git fetch`

Description

Proposal

Links / references

Is blocked by

Relates to

Activity