Currently GitLab prevents pushing secrets to a repository by matching known file names and preventing them form being committed. However, authors will often times include secrets in files that do not match these patterns and have secrets leaked from their public project..
Further information
Projects like sshgit find committed secrets and sensitive files across GitHub, Gists, GitLab and BitBucket or your local repositories in real time. From it, we can see that Google and AWS keys are among the most leaked.
GitHub currently does secret scanning whereby commit contents are scanned and matching keys are sent to the vendor and then decommissioned.
Proposal
For an MVC, a boring solution would be to obtain known regexes from top vendors and match them at commit-time. It has some performance and resource considerations though.
Designs
An error occurred while loading designs. Please try again.
Child items
0
Show closed items
GraphQL error: The resource that you are attempting to access does not exist or you don't have permission to perform this action
Here is an example regex from a popular cloud vendor like AWS (source);
Access Key: (?<![A-Z0-9])[A-Z0-9]{20}(?![A-Z0-9]). In English, this regular expression says: Find me 20-character, uppercase, alphanumeric strings that don’t have any uppercase, alphanumeric characters immediately before or after.
Secret Access Keys: (?<![A-Za-z0-9/+=])[A-Za-z0-9/+=]{40}(?![A-Za-z0-9/+=]). In English, this regular expression says: Find me 40-character, base-64 strings that don’t have any base 64 characters immediately before or after.
@danielgruesso we'd probably want this to be part of the update git hook? I don't think it could be part of pre-receive (we don't have the data yet), and post-receive is too late to prevent the push.
Since we'd want the configuration to be gitlab-side, we'd need to add a new internal API endpoint analogous to /pre_receive and /post_receive for the update hook to consume (the hooks are part of Gitaly now). That API endpoint could either do the checks itself, calling out to gitaly to receive the data it needs, or return the configuration for Gitaly to do the check.
The latter is more efficient since we don't need to stream the diff to gitlab, and we don't need to keep the HTTP connection open while the check is performed. It does complicate the implementation a little though, since the configuration and the application of it are split. Overall, I think I prefer it quite strongly to the former.
There's definitely a little gitaly work in adding the new hook, and a little gitlab work in adding the new API endpoint, whichever way we go.
If we're going to expose this capability as regular expressions, then it's worth noting that in either case, we simply won't be able to permit backtracking, e.g. in GitLab we'd be using Gitlab::UntrustedRegexp, and in Gitaly, the stdlib regexp library, both of which use re2 under the hood. Both of @tmccaslin's examples make use of backtracking (?<! and ?!), but permitting it in this feature would be a serious resource utilisation / DoS vector.
obtain known regexes from top vendors and match them at commit-time.
GitLab has no way to stop you from making commits, Git is still a distributed system. GitLab could only scan before, during, or after sending data to our servers.
GitHub currently does secret scanning whereby commit contents are scanned and matching keys are sent to the vendor and then decommissioned.
I think you're confused about what GitHub is providing, basically they post process each push, like GitLab is doing too, they just add more checks than GitLab currently does. GitHub doesn't prevent you to push, in other words, the data is leaked regardless, they're just very mature about handling this situation. This approach is sane and better than using hooks. Hooks should remain very quick, and running say 50 regexps on the commit messages, tree paths, and blobs seems like a bad way to go about it. The problem is that with hooks you'd basically halt the progress for the end user for something that only provides value 1 in a million pushes.
@tmccaslin performance-wise it would indeed be better to use a similar approach, where all the processing is done post-commit and keys sent to vendors.
We already have Category:Secret Detection, and a secrets analyzer built on top of gitleaks for this purpose, albeit within a CI pipeline context. If we're going to shift secret detection left like this, I would really like to see this done as a unified solution that leverages the same analyzer. As the beginnings of a proposal, I'd honestly like to see and do the following in this area:
Update the secrets analyzer to provide a git-native interface as well as an interface to the CI pipelines.
Finalize the custom rules work for the secrets analyzer.
Author a formalized ruleset that is specific to the use cases we'd like to catch upon push to a remote.
Bundle the secrets analyzer so it ships with GitLab omnibus.
Make secrets available so that's callable from Gitaly nodes, possibly as a persistent container.
Create or update a GitLab-bundled pre-receive hook which includes the secrets analyzer in its execution.
Given this proposal in the comment above I think there is a way to extend our Secret Detection secure category to power this set post-commit hook. That way Category:Secret Detection can manage these regexes.
This issue looks like it hasn't seen a lot of activity, so I'm going to grab it into Category:Secret Detection for tracking. We are contemplating an effort similar to the one Thomas outlined above: #246819 (comment 413806514).
I would strongly recommend against adding something like secret detection into server side hooks in Gitaly for performance reasons. For large, busy repositories this can have a disastrous effect on pushes, since each push will need to wait for the secret detection script to run before the refs are updated.
A recent incident involves a customer who did just that where their pre-receive hook was responsible for secret detection.
Thanks @jcaigitlab. We (groupstatic analysis) are definitely being careful about the approach taken here, and we are explicitly not considering any blocking features until analogous detection features have had significant soak time in a non-blocking way. Some of these are outlined in the epic (&8966 (closed)) that I just moved this issue into. I'll also bring this input into the technical discovery issue one of our group members is currently working on (#376716 (closed)).
(For what it's worth, it looks like the customer ticket involved doing a different kind of static analysis—finding a list of banned functions by running a Python script—but the cascading effect of a slow pre-receive hook would be similar to trying to do secret detection in the same place.)
IBM Cloud Engineering team is asking to add "pre-commit implementation hook" to GitLab Service with implementation logic so 3rd party secrets detection tools (Detect Secrets tool from IBM) can be instantiated by the customer (opt-in per project) to pre-scan code and dependencies before commit. This is a hook for GitLab CE.
Concern about design patterns to limit impact on Gitaly performance.
I know that @connorgilbert is working on this and ideating on this issue. @connorgilbert and @jcaigitlab have discussed the Gitaly concerns below. Have we (Secure Product) hopped on a call with IBM to chat with them about this? I want to make sure that they feel heard and we understand their full use case in the scenario we can rig a workaround while we figure out the best path forward.
Hi @sarahwaldner . I'm from IBM side. We have engaged in a call with @nbadiey and @begloff. They suggested reaching out to you and the product team to talk about the performance concern and use case. Would you or the team be available for a call tomorrow or this Wednesday?
@yuanchenlu - Apologies for the delay on our end. Please let me know if any of these time slot works for you.
26-June: 1:30pm eastern
27-June: 1:30pm eastern
GitLab team members - I'd like to suggest that this might be implemented for Developer Environments that GitLab centrally configures, including a) gitpod, b) Web IDE, c) devfile dev environments
In these environments we can safely and scalably run precommit (pre-server push) code without impacting Gitaly or GitLab server because the compute is distributed and thread safe from all GitLab backend processes.
@connorgilbert Hi there, are there any updates or any progress that I can report to my large GitLab Ultimate customer who has an interest in this requirement?
Hello @nick.thomas@zj-gitlab I saw you talked about the concerns about leaving the feature in gitaly. However, I'd like to ask what your alternative approach is. Seems like you are leveraging blocking sidekiq jobs to perform scans but do you have code I can take a look? Thanks.
I'm going to go ahead and close this issue for now as the need is duplicated in other issues/epics. GitLab released secret push protection for Dedicated customers earlier this year, will release a beta on GitLab.com in 17.1, and plans to add availability for self-managed customers in 17.3. Future updates can be tracked in this epic.