SPP - Switch to using raw_info in DiffBlobs RPC for performance improvements
RFH: https://gitlab.com/gitlab-com/request-for-help/-/issues/3043 (confidential)
Some customers are experiencing timeouts when using Secret push protection on very large changes.
This appears to be caused by a large number of DiffBlobs requests using the blob_pairs field.
Proposal
There are two performance enhancements that we had identified from the RFH:
1️⃣ Use raw_info in DiffBlobs() RPC
Recent improvements introduced in gitaly!7794 (merged) make use of the new git-diff-pairs(1) functionality in Git v2.50, enabling significantly more efficient batch diff generation. Based on internal testing, switching to raw_info can yield a 10-20x speedup when diffing large sets of blobs. We should switch from using blob_pairs to the new raw_info field, which batches diff generation using the output of the FindChangedPaths RPC.
2️⃣ Filter out paths from FindChangedPaths() response that has a DELETED status
Responses from the FindChangedPaths() RPC include paths to the files that had changed regardless of their status. However, we don't care about scanning deleted files in Secret Push Protection since they do not introduce new code.
Therefore, to improve the performance of the entire operation we could filter out files that had been DELETED – gitaly#6831 (closed) aims to introduce this as a configuration option for FindChangedPaths() but until it's implemented, we could do the filtering on our side.
In that case, we only request diffs (using DiffBlobs()) from Gitaly for the files that has modifications we care about:
ADDEDMODIFIEDTYPE_CHANGECOPIEDRENAMED
See this discussion for more details.
Enhance testing
SPP does not test each kind of git push. Update the tests so that each of the following git modifications are tested:
ADDEDMODIFIEDTYPE_CHANGECOPIEDRENAMED