Secret Detection Analyzer Git Fetching Improvements
# Secret Detection Analyzer Git Fetching Improvements ## Background and Problem Statement The GitLab Secret Detection Analyzer has a critical issue with its git repository fetching mechanism. Currently, repositories are fetched twice - once by the GitLab Runner and again by the Analyzer. While the Runner respects the `GIT_DEPTH` setting, the Analyzer ignores it and uses a maximum depth value (`maxGitDepth = 2147483647`), making the Runner's fetch redundant. This is especially problematic for large repositories, where the Analyzer is unnecessarily fetching millions of commits even when running on the default branch, causing: - Severe performance degradation and frequent job timeouts (see issue `gitlab#510939`) - Excessive resource consumption on both GitLab infrastructure and customer self-managed instances - Confusion for users who set reasonable `GIT_DEPTH` values but still experience extremely slow scans - Inconsistent behavior across different CI pipeline scenarios - Misleading logs and documentation that don't match actual behavior This epic aims to implement an intelligent git fetching strategy that improves performance while ensuring proper secret detection across all scenarios. ## Goals - Implement a smarter git fetching strategy that respects user settings - Eliminate redundant repository fetches - Improve performance, especially for large repositories - Provide clear documentation about expectations for different scan types - Ensure all necessary commits are available for proper secret scanning ## Non-Goals - Changing the core secret detection scanning mechanism - Altering how secrets are identified or reported - Modifying the GitLab Runner's behavior ## Key Deliverables 1. A new intelligent `gitFetch` function that selects the appropriate fetch strategy based on context 2. Optimized fetching implementation for large repositories that works within git's locking constraints 3. Proper handling of all identified scenarios in the scenario matrix 4. Updated documentation reflecting the new behavior 5. Integration tests validating proper behavior across all scenarios ## Timeline - Start Date: TBD - Due Date: TBD ## Child Issues (Only outlines, Todo: create issues as per code changes) * [ ] https://gitlab.com/gitlab-org/gitlab/-/issues/530417+s * [ ] https://gitlab.com/gitlab-org/gitlab/-/issues/530422+s * [ ] https://gitlab.com/gitlab-org/gitlab/-/issues/530436+s ## Approvers and Stakeholders - DRI: TBD - Approvers: TBD - Stakeholders: Secret Detection team ## Implementation Details ### Fetch Strategy Selection The implementation will include a strategy selector based on the scenario matrix: ## Fetch Strategy Selection (Updated Implementation) | Scenario | Strategy | Git Operation | |----------|----------|---------------| | Not a Git Repository | None | Skip all git operations | | Historic Scan | FetchAll | `git fetch --all` (respects depth unless `HistoricScan=true`) | | Default Branch | FetchNone | No fetch needed (directory scan) | | Custom Log Options | FetchLogOpts | Parses options: `--shallow-since=X` if `--since/--after` found, or `--depth=N` if `-n/--max-count` found | | Merge Request | FetchRange | `git fetch --shallow-since=<base_commit_date> origin <branch>` | | Unlimited Depth (`GIT_DEPTH=0`) | FetchAll | `git fetch --all` | | New Branch (First Commit) | FetchShallow | No fetch needed (scans single commit) | | New Branch (Later Commits) | FetchShallow | No fetch needed (scans `HEAD^..HEAD`) | | Branch Pipeline | FetchRange | `git fetch --shallow-since=<base_commit_date> origin <branch>` | | Force Push Detected | FetchRange | Attempts fetch, falls back to directory scan if base missing | | Log Options with No Constraints | FetchLogOpts → FetchRange | Falls back to range strategy | | Date Fetch Failure | FetchRange → FetchShallow | Falls back to single commit scan | ### Optimized Single-Fetch Approach Due to git's locking mechanism, parallel fetching is not viable. Instead, a single optimized fetch will be implemented with: - Smart depth detection to fetch only what's needed - Progress tracking for large fetches - Clear logging to indicate fetch progress - Fallback mechanisms if initial fetch strategies fail ### User Configuration The implementation will respect: - `GIT_DEPTH` for controlling fetch depth - A new `SECRET_DETECTION_GIT_FETCH_DEPTH` for override control - `SECRET_DETECTION_HISTORIC_SCAN` for full repository scanning - Use of `GIT_STRATEGY: none` in the CI template to prevent redundant Runner fetches ## References - Issue https://gitlab.com/gitlab-org/gitlab/-/issues/510939+s - Timeouts in large repositories - [Proposed implementation](https://gitlab.com/groups/gitlab-org/-/epics/12034#note_2413186689) by Aditya Tiwari ## Related Links - [GitLab Secret Detection Documentation](https://docs.gitlab.com/ee/user/application_security/secret_detection/) - Link to design doc (TBD) ## Health Status - Initial Draft ### Release notes
epic