Add multi-scan support for mixed-type security report ingestion (SARIF multi-run support)
What does this MR do and why?
See https://gitlab.com/gitlab-org/gitlab/-/work_items/595060#note_3270378278 for context (and original discussion)
SARIF is a static analysis interchange format designed to be tool-neutral. A single SARIF file may contain multiple runs[] entries (one per tool), and each run may itself contain findings of heterogeneous security types (SAST, dependency scanning, secret detection, etc.). This MR makes the ingestion pipeline handle both cases correctly.
Problem
Previously all SARIF findings were collapsed into a single Security::Report with scan_type = :sarif. This was not ideal for two reasons:
- Multi-run files: a SARIF file with a separate
runfor Semgrep and for ESLint produced one scan with an ambiguous dual-scanner identity. - Mixed-type findings: a single
runcontaining both CVE identifiers (dependency findings) and CWE identifiers (SAST findings) produced a single:sarif-typed scan, making security policies, UUID generation, and auto-resolution unreliable since they all key onscan_type.
Solution
One Security::Report (and therefore one Security::Scan DB row) per SARIF run[]. The unique security_scans index is extended with scanner_external_id so N scans of the same scan_type from the same build are permitted when they come from distinct tools.
security_scans.scanner_external_id serves two purposes:
- Write-side uniqueness: Distinguishes multiple
Security::Scanrows for the same(build_id, scan_type)when a single SARIF artifact contains multiple tool runs. Without it, the unique index would reject the second scan. - Read-side report selection: When
Security::Scan#security_reportis called, it usesscanner_external_idto pick the correct sub-report from the re-parsed artifact.
It does NOT:
- Serve as a foreign key to
vulnerability_scanners - Get exposed in any API, GraphQL type, or UI
- Participate in any query filter (the
scanner_external_idfilters onVulnerability/VulnerabilityReadare forvulnerability_scanners.external_id, a different column on a different table)
Key changes:
- DB migration: adds
scanner_external_idtosecurity_scans, replaces unique index to allow N scans per(build_id, scan_type)keyed byCOALESCE(scanner_external_id, '') Security::Report(in-memory): addsscanner_external_idattribute so parsers can tag each report with the tool identity before it reachesStoreScanServiceEE::Ci::JobArtifact: renamessecurity_reporttosecurity_reports(returnsArray<Security::Report>); keepssecurity_reportas a.firstshim forScanArtifactComparatorandSecurity::ScancallersStoreGroupedScansService: fans outStoreScanService.executeonce per report for multi-run artifactsStoreScanService: accepts optionalreport:keyword so callers can pass each report from a multi-run artifact individually; includesscanner_external_idinfind_or_initialize_byto match the COALESCE-keyed index
DB Changes
security_report.scanner.external_id is the tool name (e.g. "semgrep", "codeql"). We store it on security_scans to distinguish between different tools in a multi-run artifact.
For single-scanner artifacts (legacy behavior) there is only one scanner per artifact, so scanner_external_id is nil and the COALESCE in the index treats it identically to the old empty-slot behavior. No backfill needed.
Alternatively we could always populate scanner_external_id and backfill existing rows, but the gain in data integrity doesn't justify backfilling millions of rows for what is semantically a tie-breaker column.
Before (security_scans unique constraint):
| build_id | scan_type | scanner_external_id | Constraint |
|---|---|---|---|
| 123 | sast | NULL | Unique (123, sast) |
| 123 | dast | NULL | Unique (123, dast) |
| 123 | sarif | NULL | Unique (123, sarif) |
After (security_scans unique constraint):
Note: sarif scan_type here is descriptive but will not actually be used in the final rollout, see What this MR does NOT do:
| build_id | scan_type | scanner_external_id | Constraint |
|---|---|---|---|
| 123 | sast | NULL | Unique (123, sast, '') |
| 123 | dast | NULL | Unique (123, dast, '') |
| 123 | sarif | "semgrep" | Unique (123, sarif, semgrep) |
| 123 | sarif | "eslint" | Unique (123, sarif, eslint) |
| 123 | sarif | "codeql" | Unique (123, sarif, codeql) |
Queries
Before
From ee/app/services/security/store_scan_service.rb
https://console.postgres.ai/gitlab/gitlab-production-sec/sessions/51116/commands/151332
SELECT "security_scans".*
FROM "security_scans"
WHERE "security_scans"."build_id" = $1
AND "security_scans"."scan_type" = $2
LIMIT 1;After
legacy single-report-per-artifact case
SELECT "security_scans".*
FROM "security_scans"
WHERE "security_scans"."build_id" = $1
AND "security_scans"."scan_type" = $2
AND "security_scans"."scanner_external_id" IS NULL
LIMIT 1;multi-report-per-artifact case
SELECT "security_scans".*
FROM "security_scans"
WHERE "security_scans"."build_id" = $1
AND "security_scans"."scan_type" = $2
AND "security_scans"."scanner_external_id" = $3
LIMIT 1;What this MR does NOT do
Type inference (inferring :sast vs :dependency_scanning vs :secret_detection from CVE/CWE identifiers within a single run) is out of scope here. That requires a follow-up MR that adds identifier-based inference logic to the SARIF parser's process_run method. This MR provides the infrastructure (schema + fan-out plumbing) that follow-up work builds on.
MR acceptance checklist
- I have evaluated the MR acceptance checklist for this MR.