Infer vulnerability report type from SARIF identifiers (CVE/CWE) during ingestion
## Summary
Implement identifier-based inference of `report_type` for SARIF findings during ingestion, replacing the current generic `:sarif` report type with the appropriate security category (`:sast`, `:dependency_scanning`, `:secret_detection`, `:container_scanning`).
This is the implementation issue derived from the validated proposal in https://gitlab.com/gitlab-org/gitlab/-/work_items/596949#proposal and the discussion in https://gitlab.com/gitlab-org/gitlab/-/work_items/452042#note_3257828537.
## Background
SARIF is a scanner-agnostic format. Currently, all SARIF findings are ingested under a single `:sarif` report type, which isolates SARIF results from type-specific findings in filtering, security policies, UUID generation, and auto-resolution since they all key on `scan_type`. The research in #596949 validated that **96.6% of SARIF findings carry sufficient identifiers** (CVE/CWE) to infer the correct report type, making identifier-based inference viable.
This work is blocked on !230154 which provides the multi-scan infrastructure (per-run `Security::Report` fan-out and `scanner_external_id` on `security_scans`) that this inference logic builds on.
## Proposal
Implement a hybrid inference strategy: identifier-based classification with tool-name fallback.
### Inference Priority
```ruby
SECRET_CWES = %w[CWE-798 CWE-259 CWE-321 CWE-522].freeze
def infer_report_type_from_result(result, scanner_name)
identifiers = extract_identifiers(result)
# Priority 1: Identifier-based
if identifiers.any? { |i| i[:type] == 'cve' }
:dependency_scanning
elsif has_image_location?(result)
:container_scanning
elsif identifiers.any? { |i| i[:type] == 'cwe' && SECRET_CWES.include?(i[:id]) }
:secret_detection
elsif identifiers.any? { |i| i[:type] == 'cwe' }
:sast
# Priority 2: Default
else
:sast # SARIF = Static Analysis by default
end
end
```
### Three Validated Identifier Extraction Patterns
1. **Pattern A: CVE from `ruleId`** (Trivy, Dependency-Check) - `ruleId` contains `CVE-YYYY-NNNNN`
2. **Pattern B: CWE from `properties.tags[]`** (Semgrep, Bandit) - tags contain `CWE-NNN: Description`
3. **Pattern C: CWE from `rule.relationships[]`** (Flawfinder, SpotBugs) - relationships reference CWE taxonomy
### Integration Point
The inference logic should be added to the SARIF parser's `process_run` method. With !230154 merged, each SARIF `run[]` produces its own `Security::Report`. The inference determines the `report_type` for each report based on the findings within that run.
```ruby
def parse!(json_data, report)
# Group results by inferred type
grouped_results = sarif_results.group_by do |result|
infer_report_type_from_result(result, scanner_name)
end
# Create one Security::Report per type
grouped_results.map do |report_type, results|
create_report(type: report_type, results: results)
end
end
```
### Edge Cases
Hardcoding tool names feels too hacky so we keep them relatively limited and after the initial rollout we can gather enough data to feel more confident in the approach across all findings, rather than SARIF-specific ones.
Given ~"category:secret detection" does not generally include CWEs, we could hardcode those per above [inference priority proposal](#inference-priority) but I'm inclined to avoid doing so immediately.
Instead, we default to `:sast` under the assumption that SARIF is _primarily_ used for SAST currently and we note this fallback behavior within the ~documentation, see [documentation issue](https://gitlab.com/gitlab-org/gitlab/-/work_items/599284).
## Acceptance Criteria
> When ingesting a single semgrep SAST artifact containing both valid SAST findings and hardcoded credentials, we produce findings of both SAST and SD types.
- [x] SARIF findings with CVE identifiers are ingested as `:dependency_scanning`
- [ ] SARIF findings with image locations are ingested as `:container_scanning`
- [x] SARIF findings with secret-related CWEs (798, 259, 321, 522) are ingested as `:secret_detection`
- [x] SARIF findings with other CWEs are ingested as `:sast`
- [x] Findings with no identifiers default to `:sast`
- [x] All three identifier extraction patterns (CVE from ruleId, CWE from tags, CWE from relationships) are supported
- [x] Inferred report types appear correctly in vulnerability dashboard filters, MR widget, and Pipeline Security tab
- [x] Existing UUID generation and deduplication work correctly with inferred types
- [x] Unit tests cover all inference paths and edge cases
issue