Skip to content

repository: Allow fast-forking with preexisting object pools

Patrick Steinhardt requested to merge pks-create-fork-support-linking into master

Creating forks that deduplicate objects is quite an involved four-step process right now:

1. The client calls CreateObjectPool to create the object pool from
   the source repository. This creates a complete copy of the object
   pool and is expensive.

2. The client calls CreateFork to create the fork from the source
   repository. Again, this creates a complete copy of the object
   pool and is expensive.

3. The client calls LinkRepositoryToObjectPool to link the newly
   created fork to the object pool. No object deduplication happens
   at this step.

4. The client calls OptimizeRepository to deduplicate objects in the
   fork repository. This again is an expensive step because we need
   to rewrite the complete packfile.

After the last step we can finally present a fork to the client that has its objects deduplicated against the object pool. The complete process required us to create a total of three copies of all objects: in the source repository, the object pool and the forked repository. We also had to recompute the packfiles three times. It goes without saying that this is an expensive and slow process.

As a first step towards fixing this architecture we introduce a new mode for CreateFork that collapses steps 2 to 4 into a single step. When the source repository and forked repository are supposed to live on the same storage, and when an object pool already exists, we can make use of the preexisting object pool via git-clone(1)'s --reference= option. This will cause us to link the newly created to the object pool immediately at the time of creation, with multiple consequences:

- The caller doesn't have to call LinkRepositoryToObjectPool anymore
  as the resulting forked repository is linked to the object pool
  already.

- When cloning the repository, Git can already make use of the
  common objects shared between the object pool and the source
  repository. This speeds up the clone.

- Last but not least, because we can reuse the common objects Git
  also notices that it doesn't have to copy them into the new object
  database, either. Consequentially, we don't have to wait until the
  next call to OptimizeRepository to deduplicate objects.

Benchmarking this with a mirror clone of gitlab-com/www-gitlab-com shows that the speedup is significant:

Benchmark 1: git clone --bare --mirror --no-local pool-member.git target.git
  Time (abs ≡):        145.160 s               [User: 247.901 s, System: 34.532 s]

Benchmark 2: git clone --bare --mirror --local pool-member.git target.git
  Time (abs ≡):         5.735 s               [User: 2.621 s, System: 3.470 s]

Benchmark 3: git clone --bare --mirror --no-local --reference=pool.git pool-member.git target.git
  Time (abs ≡):        10.733 s               [User: 7.408 s, System: 3.815 s]

Benchmark 4: git clone --bare --mirror --local --reference=pool.git pool-member.git target.git
  Time (abs ≡):         5.768 s               [User: 2.615 s, System: 3.491 s]

The old way of creating the object pool is equivalent to benchmark 1, where we use neither --reference nor a --local clone, and takes over two minutes to complete. The new method which we implement here is equivalent to benchmark 3, which uses --reference but again without --local. It is 14.5x times as fast as the previous method.

Ideally, we would even be able to use --local here, which is twice as fast as the method we implemented here. But this does not work in a clustered world because we must make sure that the clone connects to a Gitaly node that has an up-to-date copy of the source repository. We thus perform the clone via SSH, which has the consequence that we cannot use local clones.

Further note that we do not enable the new mode unconditionally, but instead do so via a newly introduced Protobuf field. This is because creating forks in this new way have a few additional restrictions that are not present for the old way of creating forks, because we must ensure that source repository, object pool and forked repository all exist on the same node. And last but not least, the decision whether to link against an object pool is ultimately policy-driven and should thus be controlled by the client.

Closes #3377.

Merge request reports