Infer vulnerability report type from SARIF identifiers (CVE/CWE) during ingestion
## Summary Implement identifier-based inference of `report_type` for SARIF findings during ingestion, replacing the current generic `:sarif` report type with the appropriate security category (`:sast`, `:dependency_scanning`, `:secret_detection`, `:container_scanning`). This is the implementation issue derived from the validated proposal in https://gitlab.com/gitlab-org/gitlab/-/work_items/596949#proposal and the discussion in https://gitlab.com/gitlab-org/gitlab/-/work_items/452042#note_3257828537. ## Background SARIF is a scanner-agnostic format. Currently, all SARIF findings are ingested under a single `:sarif` report type, which isolates SARIF results from type-specific findings in filtering, security policies, UUID generation, and auto-resolution since they all key on `scan_type`. The research in #596949 validated that **96.6% of SARIF findings carry sufficient identifiers** (CVE/CWE) to infer the correct report type, making identifier-based inference viable. This work is blocked on !230154 which provides the multi-scan infrastructure (per-run `Security::Report` fan-out and `scanner_external_id` on `security_scans`) that this inference logic builds on. ## Proposal Implement a hybrid inference strategy: identifier-based classification with tool-name fallback. ### Inference Priority ```ruby SECRET_CWES = %w[CWE-798 CWE-259 CWE-321 CWE-522].freeze def infer_report_type_from_result(result, scanner_name) identifiers = extract_identifiers(result) # Priority 1: Identifier-based if identifiers.any? { |i| i[:type] == 'cve' } :dependency_scanning elsif has_image_location?(result) :container_scanning elsif identifiers.any? { |i| i[:type] == 'cwe' && SECRET_CWES.include?(i[:id]) } :secret_detection elsif identifiers.any? { |i| i[:type] == 'cwe' } :sast # Priority 2: Default else :sast # SARIF = Static Analysis by default end end ``` ### Three Validated Identifier Extraction Patterns 1. **Pattern A: CVE from `ruleId`** (Trivy, Dependency-Check) - `ruleId` contains `CVE-YYYY-NNNNN` 2. **Pattern B: CWE from `properties.tags[]`** (Semgrep, Bandit) - tags contain `CWE-NNN: Description` 3. **Pattern C: CWE from `rule.relationships[]`** (Flawfinder, SpotBugs) - relationships reference CWE taxonomy ### Integration Point The inference logic should be added to the SARIF parser's `process_run` method. With !230154 merged, each SARIF `run[]` produces its own `Security::Report`. The inference determines the `report_type` for each report based on the findings within that run. ```ruby def parse!(json_data, report) # Group results by inferred type grouped_results = sarif_results.group_by do |result| infer_report_type_from_result(result, scanner_name) end # Create one Security::Report per type grouped_results.map do |report_type, results| create_report(type: report_type, results: results) end end ``` ### Edge Cases Hardcoding tool names feels too hacky so we keep them relatively limited and after the initial rollout we can gather enough data to feel more confident in the approach across all findings, rather than SARIF-specific ones. Given ~"category:secret detection" does not generally include CWEs, we could hardcode those per above [inference priority proposal](#inference-priority) but I'm inclined to avoid doing so immediately. Instead, we default to `:sast` under the assumption that SARIF is _primarily_ used for SAST currently and we note this fallback behavior within the ~documentation, see [documentation issue](https://gitlab.com/gitlab-org/gitlab/-/work_items/599284). ## Acceptance Criteria > When ingesting a single semgrep SAST artifact containing both valid SAST findings and hardcoded credentials, we produce findings of both SAST and SD types. - [x] SARIF findings with CVE identifiers are ingested as `:dependency_scanning` - [ ] SARIF findings with image locations are ingested as `:container_scanning` - [x] SARIF findings with secret-related CWEs (798, 259, 321, 522) are ingested as `:secret_detection` - [x] SARIF findings with other CWEs are ingested as `:sast` - [x] Findings with no identifiers default to `:sast` - [x] All three identifier extraction patterns (CVE from ruleId, CWE from tags, CWE from relationships) are supported - [x] Inferred report types appear correctly in vulnerability dashboard filters, MR widget, and Pipeline Security tab - [x] Existing UUID generation and deduplication work correctly with inferred types - [x] Unit tests cover all inference paths and edge cases
issue