Skip to content

git/housekeeping: Speed up packing loose objects

When we have determined that the packfiles of a repository are in a well-defined state we inspect whether the repository has too many loose objects as a last heuristic in our new geometric-repacking strategy. If the number of loose objects exceeds a certain threshold, we'll perform a geometric repack that soaks up all loose objects into a new packfile.

While this works alright, the geometric repack will also perform a bunch of other things:

- It updates the multi-pack-index to cover the newly generated
  packfile.

- It updates the multi-pack-index bitmap in case we're not connected
  to an alternate object database.

- It potentially collapses multiple packfiles into one in order to
  restore the geometric sequence.

All of this is mighty expensive and scales with the repository size. And ultimately, it feels like a waste of resources because we have already determined that the packfiles are well-optimized already. The only thing we care about here is the loose objects, so why do anything else than to just pack them into a new packfile?

This inefficiency is demonstrated by the following benchmark in the linux.git repository. The benchmark is prepared by writing a single new reachable object into the repository and then executing the command at hand:

Benchmark 1: git repack --geometric=2 -d --write-midx
  Time (mean ± σ):      5.424 s ±  5.148 s    [User: 4.494 s, System: 0.870 s]
  Range (min … max):    3.079 s … 14.633 s    5 runs

Benchmark 2: git repack --geometric=2 -d
  Time (mean ± σ):      2.445 s ±  4.784 s    [User: 2.204 s, System: 0.234 s]
  Range (min … max):    0.296 s … 11.004 s    5 runs

Benchmark 3: git repack -d --write-midx
  Time (mean ± σ):     15.842 s ±  0.044 s    [User: 14.738 s, System: 1.044 s]
  Range (min … max):   15.777 s … 15.886 s    5 runs

Benchmark 4: git repack -d
  Time (mean ± σ):     12.942 s ±  0.049 s    [User: 12.503 s, System: 0.430 s]
  Range (min … max):   12.887 s … 13.002 s    5 runs

Benchmark 5: git pack-objects --pack-loose-unreachable --local --incremental --non-empty </dev/null .git/objects/pack/pack && git prune-packed
  Time (mean ± σ):     174.4 ms ±   5.2 ms    [User: 94.8 ms, System: 73.2 ms]
  Range (min … max):   170.0 ms … 183.2 ms    5 runs

Summary
  'git pack-objects --pack-loose-unreachable --local --incremental --non-empty </dev/null .git/objects/pack/pack && git prune-packed' ran
   14.02 ± 27.43 times faster than 'git repack --geometric=2 -d'
   31.10 ± 29.53 times faster than 'git repack --geometric=2 -d --write-midx'
   74.20 ± 2.21 times faster than 'git repack -d'
   90.83 ± 2.70 times faster than 'git repack -d --write-midx'

There are several observations here:

- The first two benchamrks demonstrate how the geometric repack
  latency fluctuates wildly. This is because it sometimes has to
  restore the geometric sequence, while it doesn't at other times.

- The "incremental" repack in benchmarks 3 and 4 are extremely
  expensive even though they supposedly only pack the new object
  into a packfile. This is because the incremental repack takes
  reachability into account so that it can skip packing objects
  which aren't referenced. This doesn't have any benefit for us
  though as we use cruft packs anyway to evict unreachable objects.

- Writing the multi-pack-index in benchmarks 1 and 3 causes us to
  add another ~3 seconds.

The last benchmark is equivalent to the new repacking strategy we have introduced in a preceding commit: in contrast to the others, it ignores reachability and will never update multi-pack-indices. Instead, all it does is to take all unpacked loose objects and write them into a new packfile. Consequentially, it is a bunch faster than the other repacking strategies.

Convert our housekeeping strategy to use this repack strategy for a nice speedup when all we want to do is optimize loose objects. Note that we don't add a feature flag for this change as this code is only hit when the geometric-repacking feature flag is enabled.

Edited by Patrick Steinhardt

Merge request reports