Secret push protection errors and timeouts (with GetTreeEntries)

Problem

As we ramped up adoption for the Secret Push Protection feature with initiatives such as enabling for all Gitlab-owned projects and for all public projects, we are increasingly receiving reports from within the team about their pushes timing out or throwing 500 errors. We have identified them to be mostly related to Gitaly's GetTreeEntries() RPC.

Below is a list of these issues and some explanation for their intricacies as well:

1️⃣ Push failing due to removed files (or commits)

When trying to push code with a branch that is based on an old master (100 commits behind), some of the files and/or the commits were no longer present in the associated git tree, causing the GetTreeEntries() RPC to throw a GRPC::NotFound: 5:invalid revision or path. error, which eventually raised a Gitlab::Git::Index::IndexError exception on the monolith side.

We did not handle this use case well since we don't use rescue_not_found: true.

This was uncovered by @vyaklushin who implemented a quick fix, which we got merged earlier this week. 🎉

2️⃣ Push timing out during repository mirroring or branch syncing

We had long identified repository mirroring or branch syncing (and similar automated workflows) to be an invalid use case for Secret Push Protection. However, up until recently, the merge-train tool used to sync between the canonical repository and the security and FOSS mirrors was executing git push commands directly, without skipping secret push protection scans, causing some of the associated merge train pipelines to fail.

To resolve this, we tried to skip the scan using -o secret_push_protection.skip_all push option (which didn't work, see next issue) and we eventually had to disable the feature for this particular project. Other projects remain unaffected.

As we remain of the belief that repository mirroring and similar automated workflows should have Secret Push Protection skipped by default, we had implemented a way to identify and stop invoking SPP for several Gitaly RPCs that normally trigger git access checks from automated workflows, but using git push in pipelines directly can still cause errors.

⚠️ We should encourage team members to never do so without skipping Secret Push Protection scans. ⚠️

3️⃣ Push cannot be skipped with -o secret_push_protection.skip_all push option

We had written a detailed report of this in #571508. Check the issue for more details.

4️⃣ Push timing out due to volume of commits

This had been a symptom in the other two reports as well.

In all three reports, we were making as many requests to the GetTreeEntries() RPC as there are commits in the push. This is inefficient as each request takes time to complete, and even if it only takes as low as 1 millisecond, this adds up for all commits, causing the scan and the entire push to eventually timeout as git access checks that is run for all push operations has a timeout of 50 seconds.

But why do we use the GetTreeEntries() RPC to begin with?

Let me add some historical context here: when SPP was first built, the DiffBlobs() RPC did not exist, so we had to scan full files instead of diffs. As we migrated to diff scanning we had GetTreeEntries() left behind despite ListAllCommits() and FindChangedPaths() both returning the file path and commit sha that we had to retrieve in order to display the scan results to the user.

We agreed to do this in a follow-up, but could not complete the refactor due to a lot of other competing priorities. The only concern we have about this plan, is the need to build out a finding-to-commit map with only ListAllCommits() and FindChangedPaths().

Why not only send those requests for commits with found secrets?

Due to several technical limitations of the Gitaly RPCs we use, there's no straightforward way to only limit those requests to the commits were secrets are detected. This is because we don't know which commits has secrets until we actually call the GetTreeEntries() RPC for each commit:

  • We start by fetching a list of new commits to scan using the ListAllCommits() RPC.
  • We pass those commits to FindChangedPaths() RPC to get all files that had changed in these commits.
  • We pass the changed paths to DiffBlobs() RPC to retrieve the actual code patches to scan.

When passing the data to the scanner, we have a blob/file ID, but there's no association between those files and the commits they come from, so we use GetTreeEntries() to find out the file path and the commit sha for each finding. In order to cross-reference them, we have to iterate through each commit and build out a finding-to-commit map that we then use to present to the user.

Potential Optimizations

We had explored a few options over the past few weeks to resolve some of the issues reported.

  • Rescue and log generic scan exceptions, but do not block the push (i.e. fail-open). It's necessary to inform users about the errors/exceptions when the push goes through though, so as not to give a false sense of security.

  • Remove calls to GetTreeEntries() RPC and refactor the scanning process. We still have to figure out how to build out the finding-to-commit map with only ListAllCommits() and FindChangedPaths(), so this could prove to be a larger effort than anticipated.

  • Address issues with DiffBlobs() newly introduced raw_info option. This new option promises performance gains, but when it was enabled behind a feature flag (now disabled) we observed several errors. Upon investigating, turns out we have to make some changes to the response received from FindChangedPaths() before passing it to DiffBlobs().

  • Skip git submodules from scanning. This one is faster and more straigtforward than the rest of the potential optimizations, we just need to skip git submodule paths when retrieved from FindChangedPaths() from scanning.

Edited by 🤖 GitLab Bot 🤖