Convert LFS pointer checks to use `--batch-all-objects`
When doing the /allowed
check, one part of what is done is to verify that the ref updates do not contain any new dangling LFS pointers without the corresponding object. This is done via GetNewLFSPointers
: given a set of positive and negative refs, we compute all objects which exist in the positive but not in the negative refs.
The problem with the current approach is that in order to compute the diff between both sets, we need to do a complete graph walk. So in essence, the check scales with repository size. E.g. for gitlab-org/gitlab, we see that the LFS pointer check typically takes about 7 seconds. Other repos which are bigger frequently exceed the allowed timeout of 30 seconds. It's thus safe to say that this simply does not scale.
Instead of scaling with repository size, what we ideally want is to scale with the push size. And that's trivial to do: when receiving a push, all objects are first written into a quarantine object directory first. Instead of traversing references to determine which objects got pushed just now, we can instead just directly determine the set of pushed objects by retrieving all objects in the quarantine object directory via env --unset=GIT_ALTERNATE_OBJECT_DIRECTORIES git cat-file --batch-all-objects
(assuming GIT_OBJECT_DIRECTORY
points to the quarantined objects). This does not require any refwalk and thus scales with push size, not repo size.
I've done a quick benchmark to verify that this indeed works. The benchmarking setup uses a mirror-clone of gitlab-org/gitlab with the following pre-receive hook. The first command with git-rev-list(1) is what we're running in production right now, while the second command is the alternative proposed implementation.
#!/bin/bash
zerooid=0000000000000000000000000000000000000000
newoids=()
while read oldoid newoid refname
do
if test "$newoid" = "$zerooid"
then
continue
fi newoids+=("$newoid")
done
hyperfine --warmup=3 \
--command-name 'LFS pointers via rev-list' \
'git rev-list --objects --filter=blob:limit=200 --no-object-names --in-commit-order "${newoids[@]}" --not --all | git cat-file --batch --buffer' \
--command-name 'LFS pointers via --batch--all-objects' \
'env --unset=GIT_ALTERNATE_OBJECT_DIRECTORIES git cat-file --buffer --batch-check="%(objecttype) %(objectsize) %(objectname)" --batch-all-objects | awk "{ if ($1 == \"blob\" && $2 <= 200) print $3 }" | git cat-file --batch --buffer'
exit 1
This leads to the following numbers:
$ git push origin master # 1000 commits with one change each
Enumerating objects: 3004, done.
Counting objects: 100% (3004/3004), done.
Delta compression using up to 8 threads
Compressing objects: 100% (2002/2002), done.
Writing objects: 100% (3003/3003), 242.37 KiB | 2.66 MiB/s, done.
Total 3003 (delta 987), reused 3 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (987/987), completed with 1 local object.
remote: Benchmark #1: LFS pointers via rev-list
remote: Time (mean ± σ): 554.3 ms ± 20.6 ms [User: 527.5 ms, System: 27.0 ms]
remote: Range (min … max): 521.9 ms … 590.5 ms 10 runs
remote:
remote: Benchmark #2: LFS pointers via --batch--all-objects
remote: Time (mean ± σ): 3.8 ms ± 1.6 ms [User: 5.8 ms, System: 2.5 ms]
remote: Range (min … max): 2.4 ms … 23.0 ms 555 runs
remote:
remote: Warning: Command took less than 5 ms to complete. Results might be inaccurate.
remote: Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
remote:
remote: Summary
remote: 'LFS pointers via --batch--all-objects' ran
remote: 145.14 ± 59.30 times faster than 'LFS pointers via rev-list'
$ git push origin $(seq -f 'branch-%g' 100) # push 100 branches, where each has the same 1000 commits plus one that is different per branch
Enumerating objects: 3304, done.
Counting objects: 100% (3304/3304), done.
Delta compression using up to 8 threads
Compressing objects: 100% (2202/2202), done.
Writing objects: 100% (3303/3303), 252.86 KiB | 1.78 MiB/s, done.
Total 3303 (delta 1187), reused 3 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (1187/1187), completed with 1 local object.
remote: Benchmark #1: LFS pointers via rev-list
remote: Time (mean ± σ): 561.1 ms ± 23.1 ms [User: 535.9 ms, System: 25.4 ms]
remote: Range (min … max): 539.8 ms … 604.2 ms 10 runs
remote:
remote: Benchmark #2: LFS pointers via --batch--all-objects
remote: Time (mean ± σ): 4.8 ms ± 4.4 ms [User: 5.5 ms, System: 3.2 ms]
remote: Range (min … max): 0.6 ms … 48.7 ms 600 runs
remote:
remote: Warning: Command took less than 5 ms to complete. Results might be inaccurate.
remote: Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
remote:
remote: Summary
remote: 'LFS pointers via --batch--all-objects' ran
remote: 117.15 ± 107.70 times faster than 'LFS pointers via rev-list'
$ git push origin gitaly/master:refs/heads/gitaly # push of unrelated history to emulate lots of objects (pushing Gitaly into the GitLab repo)
Enumerating objects: 60730, done.
Counting objects: 100% (60730/60730), done.
Delta compression using up to 8 threads
Compressing objects: 100% (15407/15407), done.
Writing objects: 100% (60730/60730), 24.02 MiB | 36.27 MiB/s, done.
Total 60730 (delta 41912), reused 60715 (delta 41901), pack-reused 0
remote: Resolving deltas: 100% (41912/41912), done.
remote: Checking connectivity: 60730, done.
remote: Benchmark #1: LFS pointers via rev-list
remote: Time (mean ± σ): 597.5 ms ± 45.9 ms [User: 564.4 ms, System: 31.2 ms]
remote: Range (min … max): 516.0 ms … 651.6 ms 10 runs
remote:
remote: Benchmark #2: LFS pointers via --batch--all-objects
remote: Time (mean ± σ): 11.4 ms ± 5.8 ms [User: 11.1 ms, System: 4.3 ms]
remote: Range (min … max): 4.2 ms … 36.3 ms 315 runs
remote:
remote: Warning: Command took less than 5 ms to complete. Results might be inaccurate.
remote: Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
remote:
remote: Summary
remote: 'LFS pointers via --batch--all-objects' ran
remote: 52.22 ± 26.74 times faster than 'LFS pointers via rev-list'
These numbers demonstrate both that the check is a lot faster and that it scales with the number of objects pushed.