Secret Detection false positive testing

Problem to solve

Changes made to the regex patterns for the secrets analyser can have an adverse effect on the quality of the findings. In particular, patterns that are too broad can result in a high rate of false positive findings. This degrades the user experience by creating unnecessary noise and reducing confidence in the secrets analyser.

Proposal

Develop a tool which can quickly scan a large number of popular repositories and consolidate findings into a single report. The tool can be run when changes are made to regex patterns, so the author can get a feel for the efficacy of their changes.

Note that this is not benchmarking because we won't be establishing a baseline for comparison.

Option 1: Mass Auto

Mass Auto is a tool developed by groupvulnerability research for concurrently performing batch work. It uses the AWS CDK to dynamically spin up compute resources on AWS according to a user supplied work.json and work.tmpl file. These files specify the data associated as well as an arbitrary shell script to execute for each job. Mass Auto executes entirely in a CI job and outputs results into a persistent S3 bucket. All compute resources are spun-down at the conclusion of the job.

I created a fork of Mass Auto and configured it to run the secrets analyser. work.json contains the top 1000 repositories with >= 500 stars on GitHub sorted by size, and the work.tmpl script simply clones the repository and executes the analyser.

With this configuration, the pipeline job took 49 minutes to run. This includes the time taken to concurrently clone each repository, run the scan, and upload the report to S3. The pipeline failed for some reason, but the reports were correctly saved to S3.

There appears to be some sort of bottleneck that prevents the desiredCount number of containers to run concurrently.

Option 2: AWS Batch/GCP Batch

Use the AWS Batch or GCP Batch service. We might be able to get better performance since these are purpose-built services.

Option 3: Run locally on an ad-hoc basis

Clone repositories and concurrently run secret detection scans locally. This has the benefit of being able to persist cloned repositories (saving heaps of time and bandwidth) for future scans, and refreshing them as necessary.

I drafted some very rough Go code that exposes a CLI for cloning repositories, executing scans (with configurable concurrency), and generating an HTML summary report. Most of the time is spent cloning repos; scanning ~1000 repos only takes around 10 minutes, and generating the summary report from individual gl-secret-detection-report.json files only takes a second or two.

The benefits of this approach is that it's relatively simple, and execution happens locally so we're not wasting cloud resources. The drawbacks are:

it'll be difficult to get reproducible results between team members' machines because we'd likely clone different revisions of each repo
no integration with CI

Edited May 30, 2023 by James Liu