Move Secret Detection script logic into the analyzer
Description
Secret Detection is unlike the other analyzers in that it scans git history by inspecting the output of git log -p ...
instead of analyzing whole files/projects. Determining what parts of a git history to scan is not a straight forward task. Currently, the vendored template uses this script to prepare the git repo and git ranges.
There are 4 scenarios we should cover to make sure if secrets do get pushed, they are caught immediately.
1. Push events
1 or more commits can be associated with a push event. Example of this would be if you are working on a feature branch and commit 5 changes to that branch then push the branch. The Secret Detection job that gets triggered by the push event should scan only those 5 commits. The commits associated with the MR can be retrieved by using CI_COMMIT_BEFORE_SHA
and CI_COMMIT_SHA
.
2. Merge events
Merge events can have multiple push events associated with the event. This means retrieving the git of commits is not as straight forward and requires some form of git log targetBranch..CIBranch
to determine the commit range.
3. Full History Scans
Some users may want to scan the entire history of their repo on all branches. This is gitleaks default behavior and is supported by the SECRET_DETECTION_HISTORIC_SCAN
variable.
4. NoGit Scans
Users may want to scan the current state of their repo for secrets. This means treating repos as plain directories and scanning the contents of files for secrets. This is the current default behavior if no options are passed to the secret detection scanner.
The git depth is set to 20 for shallow clones in Gitlab pipelines. Secret Detection overrides this and instead fetches the entire history of $CI_DEFAULT_BRANCH
and $CI_COMMIT_REF_NAME
. This can cause some performance issues for large repos as fetching the full history of a repo is an expensive task. !78321 (merged) introduces some changes that make --depth
dynamic when fetching. This should give us full coverage so that all commits being pushed to Gitlab are scanned and this should greatly reduce the time of secret detection jobs for large repos since we are only fetching what we need.
@plafoucriere and @dsearles brought up some good points when reviewing gitlab-com/www-gitlab-com!96577 (closed) saying we should move the script
logic from the vendored template to the Secret Detection analyzer. I think this would be a worthwhile change for the following reasons:
- Error handling: if an error pops up during a
git fetch
orgit log
we can handle the errors and log a useful message which would help with maintaining and debugging the analyzer. This would also help customers and support. - All Secret Detection logic tied to one release operation. Instead of updating the Secret Detection analyzer AND vendored template, we would just have to update the analyzer. This would eliminate the dance between vendored template changes and analyzer releases that we are currently doing.
Proposal
Move the vendored template script
logic into the Secret Detection analyzer