Perform secret detection for highest risk content pre-receive
Issue has been promoted
See GitLab-native pre-receive secret detection scan... (&11439).
Problem
Committing secrets to git repository is an on-going problem, and GitLab has a solution in the form of a secret detection CI job. However this only runs after the code has been added into the repository, potentially exposing it and requiring costly cleanup. You can read more about this challenge here in this blog post: https://about.gitlab.com/blog/2023/01/04/pat-revocation-coming-soon/
Instead of cleaning up, it would be better to enforce secret detection prior to adding changes to the repository. This is possible on self-managed today via server hooks, however this has multiple drawbacks:
- It is not a first-class feature of GitLab and available out of the box
- Custom server-side git hooks are not supported on GitLab.com or Dedicated due to security risks
- Custom server-side git hooks can be a source of instability and load on self-managed servers as arbitrary code runs on each commit
- We are trying to move away from custom server-side git hooks
For these reasons, we should come up with a better solution.
Intention and scope
The intention of this solution is to prevent the most common reasons for the introduction of secrets into repositories, which is accidentally including the content in your commit.
Purposeful leaking of secrets in the repository, or more uncommon instances like incorporating secrets in binary file types, are out of scope.
git push
operations
Proposed solution - Server-side secret scanning during Gitaly already has the /internal/allowed
endpoint which Gitaly contacts prior to accepting changes on a push
command. This would the the appropriate place to insert a secret scanning mechanism. Additionally, Gitaly already has the means to enumerate new blobs only. This has the added benefit that any such scanning would only scale with the size of the change and not with the size of the repository.
Pros:
- Checked at the server side, no client configuration / binaries required
- Automatically applies to all ways of committing to repositories (clients, Web UI, remote development, etc.)
Cons:
- While scanning only new blobs should scale well in most cases, there would most likely need to be a way to exclude specific non-performant file types such as binary files, or very large blobs.
- Cannot degrade other customers/users experience who may have repositories on the same node
- Cannot be a source of alerts and other instability internally
- The scanner needs to be trusted code as it will be running on our SaaS services, and will need to meet our code quality and security requirements
Details
New built-in push rule
We would introduce a new built-in push rule to GitLab, similar to the existing rule which performs checks based on filename: https://docs.gitlab.com/ee/user/project/repository/push_rules.html#prevent-pushing-secrets-to-the-repository
Usage of Gitaly internal/allowed API to only scan changes in commit
This push rule would trigger a new push check using the /internal/allowed
API: https://docs.gitlab.com/ee/development/internal_api/internal_api_allowed.html#push-checks
Details of the `/internal/allowed` functionality:
First, Gitaly sends the list of new reference tips to the /internal/allowed endpoint, which tells it which references have been updated to what commit. From here on, there are essentially two different ways to enumerate new objects.git-rev-list(1)-base enumeration
The caller essentially executes git rev-list $new_tips --not --all, which asks Git to enumerate all objects reachable from the new tips while excluding all preexisting objects. This is the obvious solution, but it is unfortunately very slow in large monorepositories. In gitlab-org/gitlab, this enumeration would frequently take around 25-30 seconds. This would be implemented e.g. via ListCommits() or ListBlobs().
quarantine-directory-based enumeration
Nowadays, all changes in Gitaly are first staged into a quarantine directory that hosts the new objects outside of the main object database. Gitaly sends information about that staging directory to the client. This has two advantages:
The client can actually tell Gitaly which objects to enumerate, which is a prerequisite for inspecting the objects in the first place. It is possible to simply enumerate all objects in that quarantine directory, which avoids the graph walk of the first option.
This is the recommended method nowadays, where the client will invoke ListAllBlobs() or ListAllCommits() to enumerate all new blobs or commits, respectively. This is way more efficient and scales with the size of the change compared to the number of references. In gitlab-org/gitlab, this has reduced the time to enumerate new objects from 25-30 seconds to about a dozen milliseconds. There are two downsides:
The blobs will have no path attached to them. This is not an issue in the context of this issue here though as we want to do content-based scanning anyway. When serving pushes, the client may end up sending more objects than required. This means that some objects might end up being validated multiple times. I don't think this is an issue either though.
Rails side
The Rails side nowadays uses the second approach when available. It still has the first approach in place in case there are calls where there is no quarantine directory available. Traditionally we only did this for listing commits, but in the context of global file size limits we have recently also extended this model to work with blobs.
#413274
Secret scanner - likely based onGitLab uses GitLeaks for CI-based secret scanning, but that may not be the best solution. Since Gitaly can provide the changeset directly, we do not need any git integration. In addition, due to the strict performance requirements, we only want to run a deterministic way. Scans which require external validation or other longer checks will be problematic to run inline with the commit transaction. We have been exploring other approaches here: #413274
Scanner requirements:
- Able to scan only the changes in the commit, which are received via gitaly API
- Ability to only scan files under a configurable size
- Ability to only scan commits with changes under a certain size / # of files
- Ability to ignore problematic file types like binary content
- Performance given scan constraints is sufficiently fast to not pose stability or UX problems
- Able to be compiled for the officially supported GitLab distributions/architectures
- Able to meet and pass our security standards
Handling false positives
We will need a way to handle false positives and still allow that content to be pushed to GitLab. While we can strive to reduce the frequency, it will happen on occasion and we need an "escape hatch" to allow work to continue.
Proposal:
- Commit message flag to indicate secret scanning should be skipped/ignored
- CI example of
[ci skip]
: https://docs.gitlab.com/ee/ci/pipelines/#skip-a-pipeline
- CI example of
- Push option to skip secret scanning
Note that the post-push secret scanning will still run, even if the pre-receive secret scanning is skipped.
Git push response on a detected secret
We should report a message report the error of the user, including the file, line, and position the secret was found in.
We should also state that if this is in error, to use the above methods of skipping secret scanning.
We support both of these options today for CI.
Feature control
This feature should be a setting which can be controlled at the group and project level.
Layered approach
Due to performance concerns of scanning arbitrary large files/commits during push operations, we will need to limit the pre-receive scanner to certain maximum file sizes and file types.
While most secrets tend to be in smaller text-based files, there is the potential that some may be present in larger or binary files.
To solve this, we would still recommend running the existing CI-based job or the upcoming continuous secret scanning feature: &8667
Rationale
The primary reasons for selecting this approach are:
- Preventing secrets from being pushed is preferable than cleaning up afterwards
- Purely client-side scanning solutions are hard to enforce and manage
- Git changes are atomic, which if we are to reject a change it occur prior accepting the change
Estimated effort and timeline
TBD
Other discussed solutions
There are a few other solutions that have been discussed:
- Requiring remote-only development. This way the the enterprise can control the environment and require scanning.
- Bundle and require client-side pre-commit hook scanning. Perhaps bundle with the glab CLI (both glab and gitleaks are golang), and find some way to require it.
Remote only development
Provide a way for enterprises to allow only remote development as a way to commit changes.
Pros:
- Allows full control over the dev environment
Cons:
- Remote development is early at GitLab. Low maturity especially around troubleshooting and other debugging scenarios.
- Requires complete buy-in from organization and developers. Some may not want to leave their existing ways to working client-side.
- Needs some way of blocking other commits from other sources
Require client-side scanning to commit
Bundle secret scanner into CLI and require scanned commits only
Pros:
- Client side solution without server side performance impact
Cons:
- Requires client-side trust which is unlikely in enterprises, unless we can find some way to strongly certifying the commit was scanned
- Additional dependency to install for end users on their machines
- We would need to also embed this in all server side editing options like single file editor, multi-file editor, and so on
Considerations
- For remote-only development, how many companies would this serve? Is remote development ready for this to be the only mechanism of development?
- For client-side pre-commit hooks, how can an enterprise enforce this to occur when users may have admin control of the device?
- Potentially https://github.com/finos/git-proxy
- For official server-side hooks, is it possible to scan with a performance and load that is light enough to be workable at scale?