Avoid triggering a re-scan when dependencies haven't changed

Why are we doing this work

Currently, SBOM scans are triggered regardless of whether a project's dependencies have actually changed. This leads to unnecessary resource use and complexity. By implementing mechanisms like change detection at the job and analyzer levels, and artifact caching on the monolith side, we can significantly reduce the number of unnecessary scans while maintaining security coverage.

Problem to solve

The current SBOM scanning process:

  • Triggers scans on every pipeline run, regardless of dependency changes (e.g. Gemfile.lock updated)
  • Processes identical dependencies and generates same results
  • Wastes computational resources by triggering the dependency detection in the analyzer and scan in the monolith

Proposal

The old proposal listed template-level optimization on changes detection. This is no longer as easy adding this flag since the v2 template no longer does existence checks. Analyzer dependency change detection would run into problems because even when all components are the same, a new advisory may have been issued which would require a re-scan. Serving a cached scanning result is closer to what we can accomplish in the next iteration with the Sbom Scanning API.

In this issue's proposal discussions the various levels at which scan result re-use is possible has been discussed. And can be visualized using the current implementation result query steps.

step location work done resources used effects
1 analyzer generate sbom job bounded
2 analyzer upload sbom api call, network I/O
  • API rate limits (authorization, upload)
  • Creates long-lived connection on instance, taking an available connection out of http pool
  • Increased chance of job timeout for large sboms
3 instance

save sbom and

trigger scan

network I/O to object storage
  • Creates long-lived connection to object-storage, potentially taking connection (limited effect)
4 analyzer poll for results api call
  • API rate limit.
5 instance create findings
  • creates sidekiq job

  • network I/O to fetch sbom

  • memory use to load sbom

  • postgresql memory and cpu

    creating findings

  • network I/O to store results

  • Occupies job in queue
  • Instance resources used
  • Postgresql resources used directly and indirectly (table, index locks)
  • Network I/O fetching sbom and storing results
6 analyzer fetch findings
  • api call
  • network I/O to fetch findings
  • API rate limits

Scan result re-use can be added at any of the steps listed, but the earlier it is done the more resource savings are actualized on network I/O, open connections to instance, rails memory use, postgresql memory use and cpu, and traffic on the processing queue.

Allowing the analyzer to send a digest of an sbom and its purl types before upload, puts result re-use at step 1 above and provides for most efficient resource use.

Outline

  1. Analyzer checks if list of components has been scanned before by sending a digest of components
    1. Instance provides new endpoint: POST /api/v4/jobs/:job_id/sbom_scans/:sbom_hash
    2. Instance tries to find an SbomScan in the project with provided sbom_hash, when it exists it creates a copy reusing the results from the original
  2. Analyzer skips upload if sbom_scan_id is returned with a 201 and goes on to fetch results with that sbom_scan_id

What does success look like

SBOM scans for the same SBOM in the project are re-used, reducing duplicate processing while maintaining security coverage freshness. The analyzer detects identical SBOMs before upload, the instance serves cached results when safe, infrastructure resource contention and load is reduced and scan times are shortened.

Outcomes

  • Fewer redundant scans triggered for the same SBOM
  • Fewer redundant uploads for the same SBOM

Functional Validation

  • Analyzer generates a digest which is a true representation of the SBOM elements which would be used in the scan
  • Analyzer calls caching endpoint correctly before starting upload
  • Instance returns 201 with sbom_scan_id when cached result exists and advisories are fresh
  • Instance returns 404 when no cache exists or when new advisories have been published
  • Analyzer skips upload and proceeds to result fetching when receiving 201
  • Instance validates advisory freshness via PackageMetadata::Checkpoint

Data integrity validation

  • New table and columns added are properly indexes and constrained to project results
  • Multiple scans can safely reference the same SbomScanResult
  • Orphaned scan results are properly cleaned up, destroy service only removes the scan when there are other references to the result

Observability validation

  • Metrics emitted for cache hit/miss rate per project
  • Metrics track scan result reuse count and advisory staleness detection
  • SBOM Scan API Dashboard updated to show cache effectiveness across projects

Rollout validation

  • Feature flag to fully control caching behaviour (when ff disabled the endpoint always gives 404)
  • Gradual rollout to gitlab-org first then to other projects on GitLab.com
  • Review metrics after each rollout iteration for rise in error rates, timeouts, and reduction in resource contention

Risks

Several risks categories are possible.

  1. Caching: Cache is not being hit and re-scans are occurring for SBOMs with finished scans
  2. DB size: In the old code path every completed dependency scanning job creates a SbomScan record. This should not change with caching, but an incorrect caching flow (stale records, incorrect caching flow) may end up creating more records and growing this table unacceptably.
  3. Stale results: If new security advisories are published after a scan completes, serving cached results could miss newly discovered vulnerabilities.
  4. SBOM digest consistency:
    1. Digest collisions could theoretically cause different SBOMs to be treated as identical
    2. Inconsistent digest parametrization (e.g. purl keeps changing even though component is the same) could lead to incorrect cash misses
    3. Digest implementation change could create inconsistencies between instance and analyzer (e.g. analyzer uses new version of digest but not instance)
  5. Wrong scan result served: Scan results will be moved to a new model and have a different object storage path. This could lead to invalid (missing) scan results.
  6. Concurrent requests for same digest: Multiple jobs scanning identical SBOMs simultaneously may create race condition and cause duplicate processing.
Risk How to address What to monitor
Ineffective caching
  • Ensure digest consistency
  • Test digest reliability for different purl types
  • Caching ratio
DB growth from new records
  • Ensure same number of SbomScan records written

    as would be with the old codepath

  • Table growth at rollout
  • Table growth over longer time period
Stale results served
  • Explicit advisory freshness check via PackageMetadata::Checkpoint
  • UAT: Perform scan, push new advisory against for an sbom component in commit, re-trigger scan
  • Automated tests: create project and automated test which simulates the above for new advisories
  • Monitor invalidation via chart in dashboard showing advisory update events and cache misses
SBOM digest consistency
  • (1) Use collision resistant algorithm (sha-256)
  • (3) Digest is versioned (e.g. sha256v1-{digest}) so that instance could tell version of algorithm used
  • (3) Worst case scenario is a cache miss and enqueing of new scan
  • (2) Monitor caching rates for different kinds of purl types (e.g. some heavily use purl qualifiers and may cause issues)
  • (2) Monitor cache misses
Wrong scan result served
  • Only populate new result model and paths.
  • Do not remove the old codepath or columns in initial roll out, allowing old reuslts to be served until old records expire.
  • Monitor for increase in errors on analyzer side when fetching scan results
Concurrent requests for same digest
  • If digest is found but the scan is not finished, treat this as a missing result. This will initiate trigger a new scan (duplicate)
  • Some duplication will exist but it is not expected to be large and will still be a significant reduction on the amount of processing currently done

Scope

  • Result reuse will be done only on a project level.
    • Security constraints haven't been evaluated for group and instance levels (and aren't really feasible anyways because same sboms are much likelier in the same project).
  • Template level change detection and result reuse is out of scope (e.g. caching the security report itself keyed by the supported files present).

Implementation plan

SBOM digest

Digest should uniquely identify the dependencies in an sbom. This implies a stable ordered and collision resistant approach. All DS analyzer SBOMs have a purl attribute which uniquely identifies the component.

  • use purl attribute
  • lexicographically sorted
  • includes qualifiers
  • if purl is not present, then the sbom cache call should be skipped (TBD)

Advisory DB Freshness

  1. for each purl_type in list supplied by client
  2. find sbom_scan with given sbom_digest
  3. query redis cache for sequence of that purl_type
  4. if redis does not have this entry, query the pm_checkpoints table
    1. store sequence value of checkpoint.purl_type in redis cache
  5. if sequence of any purl_type is newer than sbom_scan.created_at the sbom_scan is not considered fresh
    1. edge case: if checkpoint for given purl_type did not exist, then the sbom_scan is considered fresh

Race conditions for scans

Only scans in finished state are considered. When querying for an sbom_scan matching a subject sbom_digest, any scans in other states (e.g. created, running) will not be considered.

This means that several concurrent requests to scan an sbom will create a redundant scan. Because this concurrent scenario is an edge case and handling of race conditions and result expiry is more complicated, the redundancy is acceptable.

Relationship between scans and results

An SbomScanResult can have many SbomScans.

Migration

Because scans are ephemeral, no data migration is necessary. The old and new code paths are supported and can be removed once the (currently) 2-day TTL window expires. Since this feature is currently only available on GitLab.com, there are contingencies for dedicated types of instances.

Feature flag

When disabled, the instance should always serve a 404 from the cache endpoint. The sbom scan GET endpoint should serve a 410 requiring client to re-enter request process.

Plan

Update sbom scan processing to add a check on whether an identical sbom has been scanned using an sbom hash as an identifier for the scan result.

  • Analyzer
    • vulnerability module (before upload, for each sbom)
      • generate a digest of an sbom's components
      • call GitLab instance's scan result endpoint with the sbom_digest and purl_types present in sbom
      • if response is a 201, skip upload and go to fetching scan result with returned sbom_scan_id
      • emit observability event when cache found or not found
  • GitLab instance
    • API
      • add endpoint for POST /api/v4/jobs/:job_id/sbom_scans/:sbom_digest
        • call SbomScanResultCachingService with sbom_digest and purl_types
        • if sbom_scan returned respond with 201 and the sbom_scan_id
        • if feature flag is disabled always return 404
    • Services
      • add SbomScanResultCachingService
        • look up SbomScan.where(digest: digest, state: finished)
        • when found check whether there is fresher advisory data
          • check advisory freshness by fetching relevant checkpointsPackageMetadata::Checkpoint.where(data_type: :advisories, purl_type: purl_types)
            • for each checkpoint check advisory data freshness
              • return nil if new advisories have been ingested since the scan finished: checkpoint.sequence > sbom_scan.created_at
            • return sbom scan record
        • create SbomScan copy with the following parameters
          • `new_scan = SbomScan.create(digest: found_scan.digest, result: found_scan.result)
        • return new_scan
      • Update ProcessSbomScanService
        • create new record to capture uploaded scan result (replace existing)
        • result = SbomScanResult.create(file: result_file, project: sbom_scan.project)
        • add result to sbom scan and set state to finished sbom_scan.update(result: result, state: :finished)
      • update DestroySbomScanService
        • destroy sbom scans as before but do not remove result unless it's orphaned
        • results = sbom_scans.map { |scan| scan.result }
        • results.each { |result| SbomScan.count(result_id: result.id) == 0 && result.delete }
    • Data layer
      • add migration (DDL only)
        • add sbom_vulnerability_scan_results table
          • columns: project_id, file_store, file
        • update sbom_vulnerability_scans table
          • add column sbom_digest
          • add column sbom_scan_result_id
          • add index on project_id, sbom_digest
      • add SbomScanResult model
        • mount_uploader on file attribute on SbomScanUploader
        • add delete_from_storage method for cleaning up uploads
        • override hashed_path method in uploader to store project_id and model_id
      • update SbomScan model
        • add has_one relationship for SbomScanResult
        • add scope for fetching by sbom_digest
        • params: project_id, sbom_digest
        • filter: status == finished

Only one issue is needed, but the work can be broken up into several MRs. Roughly following the main points in the implementation plan above.

Sequence of messages with caching

sequenceDiagram
    participant analyzer as Dependency Scanning analyzer
    participant instance as GitLab Rails Backend
    participant database as SbomScan model

    analyzer->>+instance: POST /api/v4/jobs/:job_id/sbom_scans/:sbom_digest
    instance->>database: SbomScan.for_sbom_digest(sbom_digest)
    alt sbom_digest exists
        database-->>instance: scan1
        instance->>instance: scan2 = SbomScan.create(result: scan1.result)
        instance-->>-analyzer: status=201, response={ "download_url": "scan2.download_url"}
        analyzer-->>instance: GET response["download_url"]
    else sbom_digest does not exists
        Note over instance: follows existing path sbom upload and enqueuing background processing 
        database-->>instance: nil
        instance-->>analyzer: status=404
        analyzer-->>instance: upload SBOM
    end

Observability events

Instance

  1. Cache Check Endpoint (POST /api/v4/jobs/:job_id/sbom_scans/:sbom_digest)
    • Cache hit event
      • by project, purl type
    • Cache miss event
      • by project, purl type
    • Response time
  2. Sbom Scan
  3. SbomScanResultCachingService
    1. Digest lookup time
    2. Stale result found (e.g. result exists, but advisories updated)
    3. Fresh result found (e.g. result exists, and no new advisories)
  4. ProcessSbomScanService
    1. Ensure result upload events
  5. DestroySbomScanService
    1. sbom scan deleted, result oprhaned (bool)
    2. file cleaned up
  6. Other?
    1. Api rate limit on endpoint

Instance

  1. Cache hit/miss event
  2. Errors from digest generation
  3. Errors from cache endpoint

Rollout plan

Because SbomScan records expire after 2 days, it is simpler to support both storage locations for at least one full expiry window after turning the feature flag on globally and then removing the legacy code path. This is in contrast to a migration which will involve moving files in object storage (or instance storage) along with db migrations.

There are 2 storage locations:

  • Current: SbomScan.result_file is an attribute on the SbomScan model mounted with carrierwave
  • New code path: SbomScan.result is an active record association to SbomScanResult which has SbomScanResult.result_file attribute mounted with carrierwave

The rollout without data migration involves selectively serving the sbom scan results from either location. And then after globally rolling out the new code path (and waiting at least the minimum expiry period for SbomScan), removing the conditional.

The rollout would look like:

  • Stage 1: Deploy new code supporting dual read (SbomScan.result_file attribute and SbomScan.result association) but writing only to SbomScan.result_file attribute.
  • Stage 2a (enable feature flag for testing): Continue dual reads. SBOM scan result processing for projects with FF enabled are now writing SbomScan.result association while others write SbomScan.result_file.
  • Stage 2b (enable feature flag globally): – Continue dual reads. SBOM scan result processing writing only Sbomscan.result.
  • Stage3 (after 2 days SbomScan TTL or next milestone) – Update code to read only from SbomScan.result.
stateDiagram-v2
    direction LR
    [*] --> Stage1
    Stage1: 18.6 - Original code path (without caching)

    state Stage1 {
        direction LR
        [*] --> upload1
        upload1: Upload SBOM artifact

        upload1 --> process1
        process1: Process SBOM scan

        process1 --> store1
        store1: Store result in SbomScan.result_file

        store1 --> api_return1
        api_return1: API response - Use SbomScan.result_file
    }

    [*] --> Stage2a
    Stage2a: 18.7 (ff on selectively) - Write SbomScan.result_file or SbomScan.result while Reading either

    state Stage2a {
        direction LR
        [*] --> upload2a
        upload2a: Upload SBOM artifact

        upload2a --> process2a
        process2a: Process SBOM scan

        process2a --> store2a_ff_off: Feature flag OFF
        process2a --> store2a_ff_on: FEature flag ON

        store2a_ff_off: Store result in SbomScan.result_file
        store2a_ff_on: Store result in SbomScan.result

        store2a_ff_off --> api_return2a
        store2a_ff_on --> api_return2a

        api_return2a: API response

        api_return2a --> api_return_attr2a: Feature flag OFF
        api_return_attr2a: API response - Use SbomScan.result_file

        api_return2a --> api_return_assoc2a: Feature flag ON
        api_return_assoc2a: API response - Use SbomScan.result
    }

    [*] --> Stage2
    Stage2: Stage 2b (18.7 ff on globally) – Write only SbomScan.result while Reading either SbomScan.result_file or SbomScan.result

    state Stage2 {
        direction LR
        [*] --> upload2
        upload2: Upload SBOM artifact

        upload2 --> scan2
        scan2: Process SBOM scan

        scan2 --> api_return2
        api_return2: API response

        api_return2 --> api_return_attr2: Feature flag OFF
        api_return_attr2: API response - Use SbomScan.result_file

        api_return2 --> api_return_assoc2: Feature flag ON
        api_return_assoc2: API response - Use SbomScanResult.file
    }

    [*] --> Stage3
    Stage3: Stage 3 (18.8?) – New code path only

    state Stage3 {
        direction LR
        [*] --> upload3
        upload3: Upload SBOM artifact

        upload3 --> scan3
        scan3: Process SBOM scan

        scan3 --> api_return3
        api_return3: API response - Use SbomScanResult.file
    }

Testing plan

Unit testing

Instance

  • Unit tests for all new and modified functionality

Analyzer

  • Unit tests for all new and modified functionality
  • Unit tests for stability of digest of existing (and possible additional) test SBOMs for each purl type

Integration testing

Analyzer

  • Add integration tests for the cache miss/hit result loop with mocked api server

User Acceptance testing

Instance

  • Test cache miss/hit scenario
    • Verify pipeline security tab
    • Verify vulnerability report
  • Test with supported purl types

References

Edited by Igor Frenkel