builtin/cat-file: allow filtering objects in batch mode
Hi,
at GitLab, we sometimes have the need to list all objects regardless of
their reachability. We use git-cat-file(1) with --batch-all-objects to
do this, and typically this is quite a good fit. In some cases though,
we only want to list objects of a specific type, where we then basically
have the following pipeline:
git cat-file --batch-all-objects --batch-check='%(objecttype) %(objectname)' |
grep '^commit ' |
cut -d' ' -f2 |
git cat-file --batch
This works okayish in medium-sized repositories, but once you reach a certain size this isn't really an option anymore. In the Chromium repository for example 1 simply listing all objects in the first invocation of git-cat-file(1) takes around 80 to 100 seconds. The workload is completely I/O-bottlenecked: my machine reads at ~500MB/s, and the packfile is 50GB in size, which matches the 100 seconds that I observe.
This series addresses the issue by introducing object filters into git-cat-file(1). These object filters use the exact same syntax as the filters we have in git-rev-list(1), but only a subset of them is supported because not all filters can be computed by git-cat-file(1). Supported are "blob:none", "blob:limit=" as well as "object:type=".
The filters alone don't really help though: we still have to scan through the whole packfile in order to compute the packfiles. While we are able to shed a bit of CPU time because we can stop emitting some of the objects, we're still I/O-bottlenecked.
The second part of the series thus expands the filters so that they can make use of bitmap indices for some of the filters, if available. This allows us to efficiently answer the question where to find all objects of a specific type, and thus we can avoid scanning through the packfile and instead directly look up relevant objects, leading to a significant speedup:
Benchmark 1: git cat-file --batch-check --batch-all-objects --unordered --buffer --no-objects-filter
Time (mean ± σ): 82.806 s ± 6.363 s [User: 30.956 s, System: 8.264 s]
Range (min … max): 73.936 s … 89.690 s 10 runs
Benchmark 2: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=tag
Time (mean ± σ): 20.8 ms ± 1.3 ms [User: 6.1 ms, System: 14.5 ms]
Range (min … max): 18.2 ms … 23.6 ms 127 runs
Benchmark 3: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=commit
Time (mean ± σ): 1.551 s ± 0.008 s [User: 1.401 s, System: 0.147 s]
Range (min … max): 1.541 s … 1.566 s 10 runs
Benchmark 4: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=tree
Time (mean ± σ): 11.169 s ± 0.046 s [User: 10.076 s, System: 1.063 s]
Range (min … max): 11.114 s … 11.245 s 10 runs
Benchmark 5: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=blob
Time (mean ± σ): 67.342 s ± 3.368 s [User: 20.318 s, System: 7.787 s]
Range (min … max): 62.836 s … 73.618 s 10 runs
Benchmark 6: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=blob:none
Time (mean ± σ): 13.032 s ± 0.072 s [User: 11.638 s, System: 1.368 s]
Range (min … max): 12.960 s … 13.199 s 10 runs
Summary
git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=tag
74.75 ± 4.61 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=commit
538.17 ± 33.17 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=tree
627.98 ± 38.77 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=blob:none
3244.93 ± 257.23 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=blob
3990.07 ± 392.72 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --no-objects-filter
We now directly scale with the number of objects of a specific type contained in the packfile instead of scaling with the overall number of objects. It's quite fun to see how the math plays out: if you sum up the times for each of the types you arrive at the time for the unfiltered case.
Thanks!
Patrick
--- b4-submit-tracking ---
This section is used internally by b4 prep for tracking purposes.
{ "series": { "revision": 1, "change-id": "20250220-pks-cat-file-object-type-filter-9140c0ed5ee1", "prefixes": [] } }
Closes Allow filtering objects in `git-cat-file(1)` (#495 - closed).