This has 2 side-effects. By not using clone, every and each ref is stored unpacked first. Then fetching missing objects takes much much longer as some operations requires checking files on disk, and it looks like we may have a algorithm with polynomial complexity when refs aren't packed.
Proposal
Initial Geo sync will have to use git clone --mirror. We need to investigate if we can still use this with clone:
# Fetch the repository, using a JWT header for authentication authorization = ::Gitlab::Geo::RepoSyncRequest.new.authorization header = { "http.#{url}.extraHeader" => "Authorization: #{authorization}" }
Also it's been a long time ago since I wrote first synchronization code, we need to review the whole repository is empty/not empty statemachine and how it behaves with Geo cloning instead of creating a new repo first.
The goal here is to be able to import gitlab-ce in about 3 minutes using git clone.
I had a quick discussion about the differences between init ; fetch and clone in the #git channel on freenode. Outcomes:
The --prune argument to git fetch causes us to do a scan of refs which doesn't have to happen in the clone stage. Unsure how expensive this is when there are no refs on the local side.
We could use git ls-remote geo > packed-refs to pre-populate the refs after init, but before fetch. This creates a "broken" repository in that there are references pointing to objects we don't have; the subsequent fetch then acts much more like it does in the clone case.
The output format of ls-remote isn't quite correct - peeled tags are in slightly the wrong format. We could munge the output to be correct, or exclude peeled refs using --refs. I don't think the latter would have any negative data consequences, since the git fetch would do the peeling for us, but it might have negative performance implications. I don't know if the clone creates an initial packed-refs with peeled or unpeeled output.
Hmm, the ls-remote > packed-refs; fetch approach gives us more refs than the clone approach. refs/merge-requests seems to be missing if we just do a clone, for instance.
@nick.thomas the missing ones must be because you forget --mirror? Or are we really missing things here?
Also on the other issue, I suggested we could filter out some refs, and that would still be fine for Geo like : refs/tmp/* and refs/remotes
Looks like if we remove the --prune it does behave like the clone (trying to confirm that). If this is true, I think we should make prune a separate step, as it looks like prune algorithm is the one generating non packed refs.
gitlab-com/migration#270 (closed) suggests that switching to git clone won't help in at least some cases, unfortunately. On the server side, building the packfile simply takes too long, and using git clone doesn't speed it up to the point where it works in our environment.
(I think we could solve the authentication problem by setting GIT_CONFIG before shelling out to git clone, incidentally, but I don't think we should do this)
Ok I think we are talking about two different things.
The project that failed to clone here: gitlab-com/migration#270 (closed), failed because the source couldn't generate everything it needed or because the target machine (the one running clone) exhausted resources?
If later, then it would also break trying to pack the repository, which means we are screwed anyway.
What I want to fix in this issue is the "repository is taking too long to run git log because we have an unpacked ref.
As a bonus which may also help in your situation, 2o proposal should reduce the amount of garbage we replicate that is not needed.
You'll see that in the first case, you always get a packed-refs file, whereas in the second case you do not.
This means regardless of large or small repositories, it's always better to do a git clone to initiate. It won't help us solve the issue with large repositories, but it will get us into a better state.
What do we do with our existing repositories on GPRD? Rather than redownload them all, it seems like it's easier just to run a git gc --no-prune to optimize the existing refs.
I think I'd prefer to handle our existing dataset with a one-time maintenance task.
For packing refs in the future, I'd prefer a patch that adds git pack-refs --all command to the existing git fetch-using sequence. This will be faster and lower-impact than rearchitecting geo, Gitlab::Shell and gitaly to add a clone method RPC.
For packing refs in the future, I'd prefer a patch that adds git pack-refs --all command to the existing git fetch-using sequence. This will be faster and lower-impact than rearchitecting geo, Gitlab::Shell and gitaly to add a clone method R
What about hooking into our HousekeepingService, which manages the git gc on the push side of things? We should perhaps hook into the pull side of things in the same way.
These values come from configuration and it's interesting to note that if the periods are shared, a full repack will never be run.
I don't think we'd be able to reuse the housekeeping service directly, since we'd need to keep track of "pulls since last run" in the project registry, rather than the project. The service is quite a thin wrapper around GitGarbageCollectWorker.perform_async(@project.id, :full_repack...) anyway, so we don't lose much by duplicating parts of it for the geo pull case.
In both simulations, we get everything including the refs/merge-requests and refs/keep-around etc as I'm just cloning bypassing GitLab (so I don't have to generate JWT tokens etc). They are also all packed (in the fetch we have to pack as a separate step)
Thanks, I figured it would be silly of the client to unpack the pack file that was just sent by the server. So this issue is really about having loose refs, which is a much less CPU and I/O intensive problem to deal with than repacking pack files.
So I've been trying some alternatives and it looks like GIT_CONFIG can't be used with git clone (it doesn't read configuration from there).
The only possible solution is to use --template
GIT_CONFIG=/tmp/geo/config git clone http://primary/root/gitlab-ce.gitCloning into 'gitlab-ce'...Username for'http://primary': ^Cgit clone http://primary/root/gitlab-ce.git --template=/tmp/geoCloning into 'gitlab-ce'...
That means we will have to provide one template folder per new project, with a config file that looks like this:
[http"http://primary/namespace/repo.git"] # this can also be just "http://primary" for simplicityextraHeader = "Authorization: GL-Geo ryWvJi0AwD9K0S_eRpZq:eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJkYXRhIjoie30iLCJqdGkiOiI4MTk0ZThkYy1mZmRlLTQxY2QtYmYwMC1lMjQyZWQ5N2M5MTgiLCJpYXQiOjE1MjI4NzgzNDgsIm5iZiI6MTUyMjg3ODM0MywiZXhwIjoxNTIyODc4OTQ4fQ.beUedeUL_OmKSpDd7XHGV1OmsJTyI1L4gSiuWKCHP-o"
@brodock After your final comment here, there is reference to an MR that is merged. Does that mean this issue can be closed, or is there still work to be done here?
@rnienaber this is still relevant. The solution on !5261 (merged) is: continue to "clone" the same way we do but repack after it. this issue is to clone in a way that doesn't require a repack after so it should be slightly faster and use less resources (as the repack requires I/O and CPU).
Here is the copy/paste of the benchmark code and results:
@zj-gitlab
Here is what I'm using to benchmark and simulate this transfer (this simulates the same thing we use in Geo... this also includes the hidden refs, so its a slightly different type of clone):
Initialized empty Git repository in /root/benchmark-geo-gitlab-ce/gitlab-ce-fetch/.git/real 2m57.445suser 2m28.876ssys 0m28.080s
The main difference between the two is that with fetch, we get tons of keep-around refs unpacked on the filesystem (you can imagine how bad this behaves with NFS), here is a summary on the difference of the two folders:
Given we'll soon have core delta islands, I fully expect the timings to go down much further. For me these numbers tell me 'orders of magnitude' slowdowns aren't true, nor is there a need for a git clone RPC.
@zj-gitlab I'm happy to take another shot with current gitlab repository... in case of Geo, if we are able to cut ~ 30 seconds in a single clone (on a non-busy machine) this is a huge win for us. I don't see how delta islands would help here in any way.
(the orders of magnitude was before we did the GC after the fetch, we were able to fix the major issue, we still have the non major with correspond to the 30 seconds above on this case)
don't see how delta islands would help here in any way.
Core delta islands will pack the majority of data in a single extra pack on the server, the client will mostly get that pack and some new data that was missing since the pack got created.
I see... so the problem is not on the server but on the client side. The way git handles fetching over clone is actually different (this is not well documented)... when we first saw the issue it took some tracing and reading the source code to see that they behave differently.
What you are saying is that: the geo primary running with delta islands will give a pack that the geo secondary running a regular git fetch will receive and store as-is, not requiring the actual GC at the end, is that the case?
What we've observed so far is that when this operation happens via fetch instead of storing the pack as is, git stores it unpacked on disk, while cloning requires no extra packing, and it is stored as we expect.
In Gitaly there is something very close to what we need here: CreateRepositoryFromURL which uses a clone approach. We should investigate if current code is enough or if we need some extra flags.
What I believe is missing:
--mirror
Authorization defaults to authHeader := fmt.Sprintf("Authorization: Basic %s", base64.StdEncoding.EncodeToString([]byte(creds))) we don't want the Basic part for Geo, as we are using JWT.
I will create an issue in the gitaly issue board to discuss the implementation details.
I've just re-validate this today, and this is still broken. When we do the initial clone with fetch git first creates one file for each ref, which later needs to be packed in a separate step.
this adds a huge pressure to the disk and slows Geo by a lot.
@nhxnguyen we should consider prioritizing this, the fix is simple and it can improve all of our metrics.
A quick summary of what I've observed running the synchronization of that repository:
It takes ~930 seconds (about 15 minutes and 40 seconds) to sync the repository using fetch
It takes an additional ~480 seconds (about 8 minutes) to run housekeeping on that same repository
the idea here is that we can remove the housekeeping entirely on first execution and cut the total time from 23 minutes and 30 seconds to "just" around 16 minutes.
@nhxnguyen Yes, we would need to have support on Gitaly to switch between implementations when "cloning" for the first time or "fetching/updating" after.
When this was proposed the change was relatively simple to do, which we could do ourselves (a matter or prioritizing). As gitaly is now a much more complex application if we would take this to ourselves, it would require coordinating with them to make sure we are not breaking praefect, deduplication etc.
Proposal: investigate whether the clone command does support all the configuration we ship today through environment variable, and make a PoC using that with Geo and see what else we need to change in Praefect to make that works with the new command.
The risks that we can identify early on is that we found out some env variables only work with fetch but not with clone, in that case it will become a blocker here.
I've identified what are the changes needed and I'm working on a working demonstration (this will likely need to sync with gitaly team to make sure praefect will work as intended, but nonetheless the way it will work should be compatible)
I've validated most of the changes and they provide improvements. On a normal scenario that is 21-25% faster using less I/O, creating a lot less objects on disk and likely using less CPU/Memory overall (haven't measured that yet).
On a stress test using gitlab's own unpacked repository snapshot the difference was enourmous (385% faster) but that was not replicating the hiddenrefs in the case of clone implementation (I still think it will perform better just because it will not have to create 600K additional files on disk representing every single object or unpacked reference).
Investigating that worst case later, I found that when unpacked the repository consumed around 9GB, which would be the case when first syncing it using fetch. After GC that wents down to 2.3GB, which is what I expect to get using the regular clone in the first place.
All of that together means that this change will have significant impact when dealing with the bigger projects while still providing gains both in clock time as well as resource usages even for the smaller repositories.
@mkozono -- @brodock will be on PTO next week and following this working on Clickhouse. Given that all MRs here are merged, I think this is done and I am closing this issue.
The next item is #357462 (closed) and I don't think @brodock will be able to pick this up. I'll move it to the board and we should discuss who can finish this piece of work in the next scheduling call.