git/housekeeping: Speed up packing loose objects
When we have determined that the packfiles of a repository are in a well-defined state we inspect whether the repository has too many loose objects as a last heuristic in our new geometric-repacking strategy. If the number of loose objects exceeds a certain threshold, we'll perform a geometric repack that soaks up all loose objects into a new packfile.
While this works alright, the geometric repack will also perform a bunch of other things:
- It updates the multi-pack-index to cover the newly generated
packfile.
- It updates the multi-pack-index bitmap in case we're not connected
to an alternate object database.
- It potentially collapses multiple packfiles into one in order to
restore the geometric sequence.
All of this is mighty expensive and scales with the repository size. And ultimately, it feels like a waste of resources because we have already determined that the packfiles are well-optimized already. The only thing we care about here is the loose objects, so why do anything else than to just pack them into a new packfile?
This inefficiency is demonstrated by the following benchmark in the linux.git repository. The benchmark is prepared by writing a single new reachable object into the repository and then executing the command at hand:
Benchmark 1: git repack --geometric=2 -d --write-midx
Time (mean ± σ): 5.424 s ± 5.148 s [User: 4.494 s, System: 0.870 s]
Range (min … max): 3.079 s … 14.633 s 5 runs
Benchmark 2: git repack --geometric=2 -d
Time (mean ± σ): 2.445 s ± 4.784 s [User: 2.204 s, System: 0.234 s]
Range (min … max): 0.296 s … 11.004 s 5 runs
Benchmark 3: git repack -d --write-midx
Time (mean ± σ): 15.842 s ± 0.044 s [User: 14.738 s, System: 1.044 s]
Range (min … max): 15.777 s … 15.886 s 5 runs
Benchmark 4: git repack -d
Time (mean ± σ): 12.942 s ± 0.049 s [User: 12.503 s, System: 0.430 s]
Range (min … max): 12.887 s … 13.002 s 5 runs
Benchmark 5: git pack-objects --pack-loose-unreachable --local --incremental --non-empty </dev/null .git/objects/pack/pack && git prune-packed
Time (mean ± σ): 174.4 ms ± 5.2 ms [User: 94.8 ms, System: 73.2 ms]
Range (min … max): 170.0 ms … 183.2 ms 5 runs
Summary
'git pack-objects --pack-loose-unreachable --local --incremental --non-empty </dev/null .git/objects/pack/pack && git prune-packed' ran
14.02 ± 27.43 times faster than 'git repack --geometric=2 -d'
31.10 ± 29.53 times faster than 'git repack --geometric=2 -d --write-midx'
74.20 ± 2.21 times faster than 'git repack -d'
90.83 ± 2.70 times faster than 'git repack -d --write-midx'
There are several observations here:
- The first two benchamrks demonstrate how the geometric repack
latency fluctuates wildly. This is because it sometimes has to
restore the geometric sequence, while it doesn't at other times.
- The "incremental" repack in benchmarks 3 and 4 are extremely
expensive even though they supposedly only pack the new object
into a packfile. This is because the incremental repack takes
reachability into account so that it can skip packing objects
which aren't referenced. This doesn't have any benefit for us
though as we use cruft packs anyway to evict unreachable objects.
- Writing the multi-pack-index in benchmarks 1 and 3 causes us to
add another ~3 seconds.
The last benchmark is equivalent to the new repacking strategy we have introduced in a preceding commit: in contrast to the others, it ignores reachability and will never update multi-pack-indices. Instead, all it does is to take all unpacked loose objects and write them into a new packfile. Consequentially, it is a bunch faster than the other repacking strategies.
Convert our housekeeping strategy to use this repack strategy for a nice speedup when all we want to do is optimize loose objects. Note that we don't add a feature flag for this change as this code is only hit when the geometric-repacking feature flag is enabled.