Validate SARIF identifier coverage for report type inference
## Validation Complete ✅ Validation of SARIF `identifier` coverage across real-world security tools. **Outcome: Proceed with identifier-based inference using hybrid approach.** --- ## Tools Validated (8 total) ### SAST (5 tools) 1. ✅ **Bandit** (Python) - 100% coverage (29/29) - CWE in `properties.tags[]` 2. ✅ **Semgrep** - 100% coverage (5/5) - CWE in `properties.tags[]` 3. ✅ **Flawfinder** (C/C++) - 100% coverage (2/2) - CWE in `rule.relationships[].target.id` 4. ❌ **Brakeman** (Ruby) - 0% coverage (0/1) - **Upstream bug:** Has CWE but SARIF reporter doesn't export it 5. ✅ **CodeQL** (Code Quality) - 0% coverage (0/1) - Expected (maintainability issues have no CWE) ### Container Scanning (1 tool) 6. ✅ **Trivy** - 100% coverage (18/18) - CVE in `ruleId` ### Dependency Scanning (1 tool) 7. ✅ **OWASP Dependency-Check** - 100% coverage (3/3) - CVE in `ruleId` ### Secret Detection (1 tool) 8. ✅ **Gitleaks** - 0% coverage (0/1) - **By design:** No CWE for modern secret types (API keys, tokens) (see [CWE-based enhancement below](https://gitlab.com/gitlab-org/gitlab/-/work_items/596949#3-cwe-based-secret-detection-optional-enhancement)) ## Coverage Results **By Category:** - ✅ **Dependency Scanning:** 100% (21/21 vulnerabilities) - ✅ **SAST:** 94.7% (36/38 vulnerabilities) - ✅ **Overall:** 96.6% (57/59 vulnerabilities) **Success Criteria:** - ✅ Overall coverage ≥95% - ✅ SAST category ≥90% - ✅ Dependency category ≥95% --- ## Key Findings ### 1. Three Identifier Extraction Patterns Validated **Pattern A: CVE from `ruleId`** (Trivy, Dependency-Check) ```json { "ruleId": "CVE-2021-44228", "message": { "text": "Log4Shell vulnerability" } } ``` **Pattern B: CWE from `properties.tags[]`** (Semgrep, Bandit) ```json { "properties": { "tags": ["security", "CWE-798: Use of Hard-coded Credentials"] } } ``` **Pattern C: CWE from `rule.relationships[]`** (Flawfinder, SpotBugs) ```json { "relationships": [{ "target": { "id": "CWE-119", "toolComponent": { "name": "CWE" } } }] } ``` ### 2. Two Edge Cases Require Tool Name Fallback **Brakeman (SAST):** - **Problem:** SARIF reporter has a bug - doesn't export `cwe_id` field - **Evidence:** Source code shows `warning.cwe_id` exists but isn't included in SARIF output - **Workaround:** Tool name fallback (`scanner.name =~ /brakeman/i → :sast`) - **Long-term:** File issue with Brakeman project to fix SARIF reporter **Gitleaks (Secret Detection):** - **Problem:** No CWE identifiers by design - **Reason:** CWE taxonomy doesn't cover modern secret types (GitHub PATs, AWS keys, API tokens) - **Workaround:** Tool name fallback (`scanner.name =~ /gitleaks|trufflehog/i → :secret_detection`) - **Coverage:** Only 2 secret scanners have native SARIF support (Gitleaks, TruffleHog) ### 3. CWE-Based Secret Detection (Optional Enhancement) **Relevant CWEs for secrets:** - CWE-798: Use of Hard-coded Credentials - CWE-259: Use of Hard-coded Password - CWE-321: Use of Hard-coded Cryptographic Key - CWE-522: Insufficiently Protected Credentials **Recommendation:** Add CWE-based secret detection **before** generic CWE check: ```ruby elsif vulnerability.identifiers.any? { |i| i.external_type == 'cwe' && ['798', '259', '321', '522'].include?(i.external_id) } :secret_detection elsif vulnerability.identifiers.any? { |i| i.external_type == 'cwe' } :sast ``` **Why:** Semgrep/Bandit findings with CWE-798 should map to `:secret_detection`, not `:sast`. --- ## Proposal (Identifier Inference + Tool Name Fallback) ```ruby SECRET_CWES = %w[CWE-798 CWE-259 CWE-321 CWE-522].freeze KNOWN_SECRET_SCANNERS = %w[gitleaks trufflehog].freeze def infer_report_type_from_result(result, scanner_name) identifiers = extract_identifiers(result) # Priority 1: Identifier-based if identifiers.any? { |i| i[:type] == 'cve' } :dependency_scanning elsif has_image_location?(result) :container_scanning elsif identifiers.any? { |i| i[:type] == 'cwe' && SECRET_CWES.include?(i[:id]) } :secret_detection elsif identifiers.any? { |i| i[:type] == 'cwe' } :sast # Priority 2: Tool name fallback elsif KNOWN_SECRET_SCANNERS.any? { |name| scanner_name.downcase.include?(name) } :secret_detection elsif scanner_name =~ /brakeman/i :sast # Priority 3: Default else :sast # SARIF = Static Analysis end end ``` ```ruby def parse!(json_data, report) # Group results by inferred type grouped_results = sarif_results.group_by do |result| infer_report_type_from_result(result, scanner_name) end # Create one Security::Report per type grouped_results.map do |report_type, results| create_report(type: report_type, results: results) end end ``` ## Fallback Type Decision: `:sast` (Not `:generic`) I originally proposed falling back to `:generic` but that's a special type we only use for manual findings. It also have no auto-resolution logic. So given SARIF's specific uses, we should fallback to `:sast` (Not `:generic`). This is still challenging however, so for full coverage (see above) including SARIF ingestion of secrets, we would need to either allowlist specific secret detectors or slurp them up as SAST results. ## Alternative Approach Considered: Reject/Drop Findings Reject SARIF uploads with no identifiers instead of using fallback Pros: - Explicit failure (users know upload didn't work) - No misclassification - Forces tool vendors to emit proper identifiers Cons: - Breaks user workflows (Gitleaks/TruffleHog uploads would fail) - Poor UX (users expect SARIF to "just work") - Doesn't scale (every new tool without identifiers breaks) Decided fallback approach is better for public SARIF ingestion --- ## SARIF Samples Generated https://gitlab.com/theoretick/sarif_survey
issue