This has 2 side-effects. By not using clone, every and each ref is stored unpacked first. Then fetching missing objects takes much much longer as some operations requires checking files on disk, and it looks like we may have a algorithm with polynomial complexity when refs aren't packed.
Proposal
Initial Geo sync will have to use git clone --mirror. We need to investigate if we can still use this with clone:
# Fetch the repository, using a JWT header for authentication authorization = ::Gitlab::Geo::RepoSyncRequest.new.authorization header = { "http.#{url}.extraHeader" => "Authorization: #{authorization}" }
Also it's been a long time ago since I wrote first synchronization code, we need to review the whole repository is empty/not empty statemachine and how it behaves with Geo cloning instead of creating a new repo first.
The goal here is to be able to import gitlab-ce in about 3 minutes using git clone.
I had a quick discussion about the differences between init ; fetch and clone in the #git channel on freenode. Outcomes:
The --prune argument to git fetch causes us to do a scan of refs which doesn't have to happen in the clone stage. Unsure how expensive this is when there are no refs on the local side.
We could use git ls-remote geo > packed-refs to pre-populate the refs after init, but before fetch. This creates a "broken" repository in that there are references pointing to objects we don't have; the subsequent fetch then acts much more like it does in the clone case.
The output format of ls-remote isn't quite correct - peeled tags are in slightly the wrong format. We could munge the output to be correct, or exclude peeled refs using --refs. I don't think the latter would have any negative data consequences, since the git fetch would do the peeling for us, but it might have negative performance implications. I don't know if the clone creates an initial packed-refs with peeled or unpeeled output.
Hmm, the ls-remote > packed-refs; fetch approach gives us more refs than the clone approach. refs/merge-requests seems to be missing if we just do a clone, for instance.
@nick.thomas the missing ones must be because you forget --mirror? Or are we really missing things here?
Also on the other issue, I suggested we could filter out some refs, and that would still be fine for Geo like : refs/tmp/* and refs/remotes
Looks like if we remove the --prune it does behave like the clone (trying to confirm that). If this is true, I think we should make prune a separate step, as it looks like prune algorithm is the one generating non packed refs.
gitlab-com/migration#270 (closed) suggests that switching to git clone won't help in at least some cases, unfortunately. On the server side, building the packfile simply takes too long, and using git clone doesn't speed it up to the point where it works in our environment.
(I think we could solve the authentication problem by setting GIT_CONFIG before shelling out to git clone, incidentally, but I don't think we should do this)
Ok I think we are talking about two different things.
The project that failed to clone here: gitlab-com/migration#270 (closed), failed because the source couldn't generate everything it needed or because the target machine (the one running clone) exhausted resources?
If later, then it would also break trying to pack the repository, which means we are screwed anyway.
What I want to fix in this issue is the "repository is taking too long to run git log because we have an unpacked ref.
As a bonus which may also help in your situation, 2o proposal should reduce the amount of garbage we replicate that is not needed.
You'll see that in the first case, you always get a packed-refs file, whereas in the second case you do not.
This means regardless of large or small repositories, it's always better to do a git clone to initiate. It won't help us solve the issue with large repositories, but it will get us into a better state.
What do we do with our existing repositories on GPRD? Rather than redownload them all, it seems like it's easier just to run a git gc --no-prune to optimize the existing refs.
I think I'd prefer to handle our existing dataset with a one-time maintenance task.
For packing refs in the future, I'd prefer a patch that adds git pack-refs --all command to the existing git fetch-using sequence. This will be faster and lower-impact than rearchitecting geo, Gitlab::Shell and gitaly to add a clone method RPC.
For packing refs in the future, I'd prefer a patch that adds git pack-refs --all command to the existing git fetch-using sequence. This will be faster and lower-impact than rearchitecting geo, Gitlab::Shell and gitaly to add a clone method R
What about hooking into our HousekeepingService, which manages the git gc on the push side of things? We should perhaps hook into the pull side of things in the same way.
These values come from configuration and it's interesting to note that if the periods are shared, a full repack will never be run.
I don't think we'd be able to reuse the housekeeping service directly, since we'd need to keep track of "pulls since last run" in the project registry, rather than the project. The service is quite a thin wrapper around GitGarbageCollectWorker.perform_async(@project.id, :full_repack...) anyway, so we don't lose much by duplicating parts of it for the geo pull case.