Perform secret detection for highest risk content pre-receive

Issue has been promoted

See GitLab-native Secret Push Protection (&11439).

Problem

Committing secrets to git repository is an on-going problem, and GitLab has a solution in the form of a secret detection CI job. However this only runs after the code has been added into the repository, potentially exposing it and requiring costly cleanup. You can read more about this challenge here in this blog post: https://about.gitlab.com/blog/2023/01/04/pat-revocation-coming-soon/

Instead of cleaning up, it would be better to enforce secret detection prior to adding changes to the repository. This is possible on self-managed today via server hooks, however this has multiple drawbacks:

It is not a first-class feature of GitLab and available out of the box
Custom server-side git hooks are not supported on GitLab.com or Dedicated due to security risks
Custom server-side git hooks can be a source of instability and load on self-managed servers as arbitrary code runs on each commit
We are trying to move away from custom server-side git hooks

For these reasons, we should come up with a better solution.

Intention and scope

The intention of this solution is to prevent the most common reasons for the introduction of secrets into repositories, which is accidentally including the content in your commit.

Purposeful leaking of secrets in the repository, or more uncommon instances like incorporating secrets in binary file types, are out of scope.

Proposed solution - Server-side secret scanning during `git push` operations

Gitaly already has the /internal/allowed endpoint which Gitaly contacts prior to accepting changes on a push command. This would the the appropriate place to insert a secret scanning mechanism. Additionally, Gitaly already has the means to enumerate new blobs only. This has the added benefit that any such scanning would only scale with the size of the change and not with the size of the repository.

Pros:

Checked at the server side, no client configuration / binaries required
Automatically applies to all ways of committing to repositories (clients, Web UI, remote development, etc.)

Cons:

While scanning only new blobs should scale well in most cases, there would most likely need to be a way to exclude specific non-performant file types such as binary files, or very large blobs.
Cannot degrade other customers/users experience who may have repositories on the same node
Cannot be a source of alerts and other instability internally
The scanner needs to be trusted code as it will be running on our SaaS services, and will need to meet our code quality and security requirements

Details

New built-in push rule

We would introduce a new built-in push rule to GitLab, similar to the existing rule which performs checks based on filename: https://docs.gitlab.com/ee/user/project/repository/push_rules.html#prevent-pushing-secrets-to-the-repository

Usage of Gitaly internal/allowed API to only scan changes in commit

This push rule would trigger a new push check using the /internal/allowed API: https://docs.gitlab.com/ee/development/internal_api/internal_api_allowed.html#push-checks

Details of the `/internal/allowed` functionality:

First, Gitaly sends the list of new reference tips to the /internal/allowed endpoint, which tells it which references have been updated to what commit. From here on, there are essentially two different ways to enumerate new objects.

git-rev-list(1)-base enumeration

The caller essentially executes git rev-list $new_tips --not --all, which asks Git to enumerate all objects reachable from the new tips while excluding all preexisting objects. This is the obvious solution, but it is unfortunately very slow in large monorepositories. In gitlab-org/gitlab, this enumeration would frequently take around 25-30 seconds. This would be implemented e.g. via ListCommits() or ListBlobs().

quarantine-directory-based enumeration

Nowadays, all changes in Gitaly are first staged into a quarantine directory that hosts the new objects outside of the main object database. Gitaly sends information about that staging directory to the client. This has two advantages:

The client can actually tell Gitaly which objects to enumerate, which is a prerequisite for inspecting the objects in the first place. It is possible to simply enumerate all objects in that quarantine directory, which avoids the graph walk of the first option.

This is the recommended method nowadays, where the client will invoke ListAllBlobs() or ListAllCommits() to enumerate all new blobs or commits, respectively. This is way more efficient and scales with the size of the change compared to the number of references. In gitlab-org/gitlab, this has reduced the time to enumerate new objects from 25-30 seconds to about a dozen milliseconds. There are two downsides:

The blobs will have no path attached to them. This is not an issue in the context of this issue here though as we want to do content-based scanning anyway. When serving pushes, the client may end up sending more objects than required. This means that some objects might end up being validated multiple times. I don't think this is an issue either though.

Rails side

The Rails side nowadays uses the second approach when available. It still has the first approach in place in case there are calls where there is no quarantine directory available. Traditionally we only did this for listing commits, but in the context of global file size limits we have recently also extended this model to work with blobs.

Secret scanner - likely based on #413274

GitLab uses GitLeaks for CI-based secret scanning, but that may not be the best solution. Since Gitaly can provide the changeset directly, we do not need any git integration. In addition, due to the strict performance requirements, we only want to run a deterministic way. Scans which require external validation or other longer checks will be problematic to run inline with the commit transaction. We have been exploring other approaches here: #413274

Scanner requirements:

Able to scan only the changes in the commit, which are received via gitaly API
Ability to only scan files under a configurable size
Ability to only scan commits with changes under a certain size / # of files
Ability to ignore problematic file types like binary content
Performance given scan constraints is sufficiently fast to not pose stability or UX problems
Able to be compiled for the officially supported GitLab distributions/architectures
Able to meet and pass our security standards

Handling false positives

We will need a way to handle false positives and still allow that content to be pushed to GitLab. While we can strive to reduce the frequency, it will happen on occasion and we need an "escape hatch" to allow work to continue.

Proposal:

Commit message flag to indicate secret scanning should be skipped/ignored
- CI example of [ci skip]: https://docs.gitlab.com/ee/ci/pipelines/#skip-a-pipeline
Push option to skip secret scanning
- CI example: https://docs.gitlab.com/ee/user/project/push_options.html

Note that the post-push secret scanning will still run, even if the pre-receive secret scanning is skipped.

Git push response on a detected secret

We should report a message report the error of the user, including the file, line, and position the secret was found in.

We should also state that if this is in error, to use the above methods of skipping secret scanning.

We support both of these options today for CI.

Feature control

This feature should be a setting which can be controlled at the group and project level.

Layered approach

Due to performance concerns of scanning arbitrary large files/commits during push operations, we will need to limit the pre-receive scanner to certain maximum file sizes and file types.

While most secrets tend to be in smaller text-based files, there is the potential that some may be present in larger or binary files.

To solve this, we would still recommend running the existing CI-based job or the upcoming continuous secret scanning feature: &8667

Rationale

The primary reasons for selecting this approach are:

Preventing secrets from being pushed is preferable than cleaning up afterwards
Purely client-side scanning solutions are hard to enforce and manage
Git changes are atomic, which if we are to reject a change it occur prior accepting the change

Estimated effort and timeline

TBD

Perform secret detection for highest risk content pre-receive

Issue has been promoted

Problem

Intention and scope

Proposed solution - Server-side secret scanning during `git push` operations

Details

New built-in push rule

Usage of Gitaly internal/allowed API to only scan changes in commit

git-rev-list(1)-base enumeration

quarantine-directory-based enumeration

Rails side

Secret scanner - likely based on #413274

Handling false positives

Git push response on a detected secret

Feature control

Layered approach

Rationale

Estimated effort and timeline

Other discussed solutions

Remote only development

Require client-side scanning to commit

Considerations

Perform secret detection for highest risk content pre-receive

Issue has been promoted

Problem

Intention and scope

Proposed solution - Server-side secret scanning during git push operations

Details

New built-in push rule

Usage of Gitaly internal/allowed API to only scan changes in commit

git-rev-list(1)-base enumeration

quarantine-directory-based enumeration

Rails side

Secret scanner - likely based on #413274

Handling false positives

Git push response on a detected secret

Feature control

Layered approach

Rationale

Estimated effort and timeline

Other discussed solutions

Remote only development

Require client-side scanning to commit

Considerations

Proposed solution - Server-side secret scanning during `git push` operations