Add multi-scan support for mixed-type security report ingestion (SARIF multi-run support)

What does this MR do and why?

See https://gitlab.com/gitlab-org/gitlab/-/work_items/595060#note_3270378278 for context (and original discussion)

SARIF is a static analysis interchange format designed to be tool-neutral. A single SARIF file may contain multiple runs[] entries (one per tool), and each run may itself contain findings of heterogeneous security types (SAST, dependency scanning, secret detection, etc.). This MR makes the ingestion pipeline handle both cases correctly.

Problem

Previously all SARIF findings were collapsed into a single Security::Report with scan_type = :sarif. This was not ideal for two reasons:

  1. Multi-run files: a SARIF file with a separate run for Semgrep and for ESLint produced one scan with an ambiguous dual-scanner identity.
  2. Mixed-type findings: a single run containing both CVE identifiers (dependency findings) and CWE identifiers (SAST findings) produced a single :sarif-typed scan, making security policies, UUID generation, and auto-resolution unreliable since they all key on scan_type.

Solution

One Security::Report (and therefore one Security::Scan DB row) per SARIF run[]. The unique security_scans index is extended with scanner_external_id so N scans of the same scan_type from the same build are permitted when they come from distinct tools.

security_scans.scanner_external_id serves two purposes:

  1. Write-side uniqueness: Distinguishes multiple Security::Scan rows for the same (build_id, scan_type) when a single SARIF artifact contains multiple tool runs. Without it, the unique index would reject the second scan.
  2. Read-side report selection: When Security::Scan#security_report is called, it uses scanner_external_id to pick the correct sub-report from the re-parsed artifact.

It does NOT:

  • Serve as a foreign key to vulnerability_scanners
  • Get exposed in any API, GraphQL type, or UI
  • Participate in any query filter (the scanner_external_id filters on Vulnerability/VulnerabilityRead are for vulnerability_scanners.external_id, a different column on a different table)

Key changes:

  • DB migration: adds scanner_external_id to security_scans, replaces unique index to allow N scans per (build_id, scan_type) keyed by COALESCE(scanner_external_id, '')
  • Security::Report (in-memory): adds scanner_external_id attribute so parsers can tag each report with the tool identity before it reaches StoreScanService
  • EE::Ci::JobArtifact: renames security_report to security_reports (returns Array<Security::Report>); keeps security_report as a .first shim for ScanArtifactComparator and Security::Scan callers
  • StoreGroupedScansService: fans out StoreScanService.execute once per report for multi-run artifacts
  • StoreScanService: accepts optional report: keyword so callers can pass each report from a multi-run artifact individually; includes scanner_external_id in find_or_initialize_by to match the COALESCE-keyed index

DB Changes

security_report.scanner.external_id is the tool name (e.g. "semgrep", "codeql"). We store it on security_scans to distinguish between different tools in a multi-run artifact.

For single-scanner artifacts (legacy behavior) there is only one scanner per artifact, so scanner_external_id is nil and the COALESCE in the index treats it identically to the old empty-slot behavior. No backfill needed.

Alternatively we could always populate scanner_external_id and backfill existing rows, but the gain in data integrity doesn't justify backfilling millions of rows for what is semantically a tie-breaker column.

Before (security_scans unique constraint):

build_id scan_type scanner_external_id Constraint
123 sast NULL Unique (123, sast)
123 dast NULL Unique (123, dast)
123 sarif NULL Unique (123, sarif)

After (security_scans unique constraint):

Note: sarif scan_type here is descriptive but will not actually be used in the final rollout, see What this MR does NOT do:

build_id scan_type scanner_external_id Constraint
123 sast NULL Unique (123, sast, '')
123 dast NULL Unique (123, dast, '')
123 sarif "semgrep" Unique (123, sarif, semgrep)
123 sarif "eslint" Unique (123, sarif, eslint)
123 sarif "codeql" Unique (123, sarif, codeql)

Queries

Before

From ee/app/services/security/store_scan_service.rb

https://console.postgres.ai/gitlab/gitlab-production-sec/sessions/51116/commands/151332

SELECT "security_scans".*
FROM "security_scans"
WHERE "security_scans"."build_id" = $1
  AND "security_scans"."scan_type" = $2
LIMIT 1;

After

legacy single-report-per-artifact case

SELECT "security_scans".*
FROM "security_scans"
WHERE "security_scans"."build_id" = $1
  AND "security_scans"."scan_type" = $2
  AND "security_scans"."scanner_external_id" IS NULL
LIMIT 1;

multi-report-per-artifact case

SELECT "security_scans".*
FROM "security_scans"
WHERE "security_scans"."build_id" = $1
  AND "security_scans"."scan_type" = $2
  AND "security_scans"."scanner_external_id" = $3
LIMIT 1;

What this MR does NOT do

Type inference (inferring :sast vs :dependency_scanning vs :secret_detection from CVE/CWE identifiers within a single run) is out of scope here. That requires a follow-up MR that adds identifier-based inference logic to the SARIF parser's process_run method. This MR provides the infrastructure (schema + fan-out plumbing) that follow-up work builds on.

MR acceptance checklist

Edited by Lucas Charles

Merge request reports

Loading