Skip to content

blob: Speed up LFS pointer search via object type filters

Patrick Steinhardt requested to merge pks-lfs-pointers-object-type-filter into master

The ListLFSPointers() RPC returns all LFS pointers referenced by a set of revisions. This filtering is quite expensive: we first need to enumerate all reachable objects, then for each object we need to see whether it's a blob and whether its size indicates that it can be an LFS pointer, and finally we need to check the blobs' contents and test whether it really is an LFS pointer.

To optimize this a bit, we do set up a blob size limit of 200 bytes, which is the maximum size an LFS pointer can have. While this severely brings down the number of candidate blobs, one issue we have is that git-rev-list(1) will still unconditionally list all the other object types. Effectively, we're thus needlessly retrieving object info of all tags, commits and trees only to notice that they aren't blobs in the first place. It goes without saying that this is a huge waste of time.

To tackle this problem, we have upstreamed two new options for git-rev-list(1):

- By default, git-rev-list(1) will always unconditionally print
  objects which have directly been received either via the command
  line or via stdin. A new option `--filter-provided-objects` has
  been added which changes this behaviour and also causes provided
  revisions to be filtered.

- A new object type filter `--filter=object:type=<type>` has been
  added which will cause git-rev-list(1) to only list objects whose
  type matches the given type.

Used in combination, this brings down the number of potential LFS pointer candidates by a significant factor. Executed on linux.git:

$ git rev-list --objects --filter=blob:limit=200 --all | wc -l
7146677

$ git rev-list --objects --filter=blob:limit=200 --all \
    --filter=object:type=blob --filter-provided-objects | wc -l
15217

For this particular repo, we have a factor of 470 less objects to check for whether they are an LFS pointer or not. Naturally, this is an artificial demonstration only because we don't typically search LFS objects with --all. But we can expect that this translates to speedups at a smaller scale by not having to do pointless work.

So let's use this by setting up the new withObjectTypeFilter() option in case we're running a Git version which supports it. No new feature flag is introduced given that we only implement it on the new pipeline code, which is already guarded by a featureflag anyway.

Part of #3618 (closed)

Merge request reports