builtin/cat-file: allow filtering objects in batch mode

Hi,

at GitLab, we sometimes have the need to list all objects regardless of their reachability. We use git-cat-file(1) with --batch-all-objects to do this, and typically this is quite a good fit. In some cases though, we only want to list objects of a specific type, where we then basically have the following pipeline:

git cat-file --batch-all-objects --batch-check='%(objecttype) %(objectname)' |
grep '^commit ' |
cut -d' ' -f2 |
git cat-file --batch

This works okayish in medium-sized repositories, but once you reach a certain size this isn't really an option anymore. In the Chromium repository for example 1 simply listing all objects in the first invocation of git-cat-file(1) takes around 80 to 100 seconds. The workload is completely I/O-bottlenecked: my machine reads at ~500MB/s, and the packfile is 50GB in size, which matches the 100 seconds that I observe.

This series addresses the issue by introducing object filters into git-cat-file(1). These object filters use the exact same syntax as the filters we have in git-rev-list(1), but only a subset of them is supported because not all filters can be computed by git-cat-file(1). Supported are "blob:none", "blob:limit=" as well as "object:type=".

The filters alone don't really help though: we still have to scan through the whole packfile in order to compute the packfiles. While we are able to shed a bit of CPU time because we can stop emitting some of the objects, we're still I/O-bottlenecked.

The second part of the series thus expands the filters so that they can make use of bitmap indices for some of the filters, if available. This allows us to efficiently answer the question where to find all objects of a specific type, and thus we can avoid scanning through the packfile and instead directly look up relevant objects, leading to a significant speedup:

Benchmark 1: git cat-file --batch-check --batch-all-objects --unordered --buffer --no-objects-filter
  Time (mean ± σ):     82.806 s ±  6.363 s    [User: 30.956 s, System: 8.264 s]
  Range (min … max):   73.936 s … 89.690 s    10 runs

Benchmark 2: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=tag
  Time (mean ± σ):      20.8 ms ±   1.3 ms    [User: 6.1 ms, System: 14.5 ms]
  Range (min … max):    18.2 ms …  23.6 ms    127 runs

Benchmark 3: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=commit
  Time (mean ± σ):      1.551 s ±  0.008 s    [User: 1.401 s, System: 0.147 s]
  Range (min … max):    1.541 s …  1.566 s    10 runs

Benchmark 4: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=tree
  Time (mean ± σ):     11.169 s ±  0.046 s    [User: 10.076 s, System: 1.063 s]
  Range (min … max):   11.114 s … 11.245 s    10 runs

Benchmark 5: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=blob
  Time (mean ± σ):     67.342 s ±  3.368 s    [User: 20.318 s, System: 7.787 s]
  Range (min … max):   62.836 s … 73.618 s    10 runs

Benchmark 6: git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=blob:none
  Time (mean ± σ):     13.032 s ±  0.072 s    [User: 11.638 s, System: 1.368 s]
  Range (min … max):   12.960 s … 13.199 s    10 runs

Summary
  git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=tag
   74.75 ± 4.61 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=commit
  538.17 ± 33.17 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=tree
  627.98 ± 38.77 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=blob:none
 3244.93 ± 257.23 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --objects-filter=object:type=blob
 3990.07 ± 392.72 times faster than git cat-file --batch-check --batch-all-objects --unordered --buffer --no-objects-filter

We now directly scale with the number of objects of a specific type contained in the packfile instead of scaling with the overall number of objects. It's quite fun to see how the math plays out: if you sum up the times for each of the types you arrive at the time for the unfiltered case.

Thanks!

Patrick

To: git@vger.kernel.org

--- b4-submit-tracking ---

This section is used internally by b4 prep for tracking purposes.

{ "series": { "revision": 1, "change-id": "20250220-pks-cat-file-object-type-filter-9140c0ed5ee1", "prefixes": [] } }

Closes Allow filtering objects in `git-cat-file(1)` (#495 - closed).

Edited by Patrick Steinhardt

Merge request reports

Loading