Skip to content

Consider using the `list_commits` and `find_changed_paths` methods to retrieve blob metadata

Context

At the time groupsource code were working on delivering GitGuardian-based pre-receive secret detection, they refactored the way blobs are retrieved from Gitaly to make the use of ListAllBlobs() and ListBlobs() RPCs possible in other similar access checks.

They had also updated the new ChangedBlobs service they had introduced as follows:

  1. Updated the call to ListBlobs() to pass with_paths flag.
  2. Updated the call to ListAllBlobs() to use an extra RPC (i.e. FindChangedPaths) to get changed paths.

In order to be able to retrieve file paths of blobs a quarantine directory exist or not.

For more information on this, please see these two merge requests: 1, 2.

Overview

A proposal was made by @igor.drozdov in this POC merge request to replace:

With both:

To load a blob metadata (i.e. file path and commit sha) without having to walk the git tree of all changed commits.

However, such change would likely involve an extra call to Gitaly as can be seen below.

Current Implementation

  • ListAllBlobs() or ListBlobs() – 1 per scan
  • GetTreeEntries() – 1 per commit

Igor's POC

  • ListAllBlobs() or ListBlobs() – 1 per scan
  • ListAllCommits() – 1 per scan
  • FindChangedPaths() – 1 per commit

Assistance Needed

This issue might require some assistance from either groupgitaly or groupsource code.

Proposal

We should spend a bit of time to explore the proposal from the aforementioned merge request, with the aim being:

  • Confirm the potential performance gain by profiling the change using our automated profiling tests.
  • Confirm the technical approach matches our plans and works for our use case.

And if we are happy with the results, we should consider updating the implementation as proposed.

Edited by Ahmed Hemdan