Avoid triggering a re-scan when dependencies haven't changed
Why are we doing this work
Currently, SBOM scans are triggered regardless of whether a project's dependencies have actually changed. This leads to unnecessary resource use and complexity. By implementing mechanisms like change detection at the job and analyzer levels, and artifact caching on the monolith side, we can significantly reduce the number of unnecessary scans while maintaining security coverage.
Relevant links
- Implement soft rate limiting for SBOM Scan Proc... (#561759 - closed) • Olivier Gonzalez • 18.4
- Dependencies persist when supported files are r... (#560331) • Unassigned • 18.7
Problem to solve
The current SBOM scanning process:
- Triggers scans on every pipeline run, regardless of dependency changes (e.g. Gemfile.lock updated)
- Processes identical dependencies and generates same results
- Wastes computational resources by triggering the dependency detection in the analyzer and scan in the monolith
Proposal
The old proposal listed template-level optimization on changes detection. This is no longer as easy adding this flag since the v2 template no longer does existence checks. Analyzer dependency change detection would run into problems because even when all components are the same, a new advisory may have been issued which would require a re-scan. Serving a cached scanning result is closer to what we can accomplish in the next iteration with the Sbom Scanning API.
In this issue's proposal discussions the various levels at which scan result re-use is possible has been discussed. And can be visualized using the current implementation result query steps.
| step | location | work done | resources used | effects |
|---|---|---|---|---|
| 1 | analyzer | generate sbom | job bounded | |
| 2 | analyzer | upload sbom | api call, network I/O |
|
| 3 | instance |
save sbom and trigger scan |
network I/O to object storage |
|
| 4 | analyzer | poll for results | api call |
|
| 5 | instance | create findings |
|
|
| 6 | analyzer | fetch findings |
|
|
Scan result re-use can be added at any of the steps listed, but the earlier it is done the more resource savings are actualized on network I/O, open connections to instance, rails memory use, postgresql memory use and cpu, and traffic on the processing queue.
Allowing the analyzer to send a digest of an sbom and its purl types before upload, puts result re-use at step 1 above and provides for most efficient resource use.
Outline
- Analyzer checks if list of components has been scanned before by sending a digest of components
- Instance provides new endpoint:
POST /api/v4/jobs/:job_id/sbom_scans/:sbom_hash - Instance tries to find an
SbomScanin the project with providedsbom_hash, when it exists it creates a copy reusing the results from the original
- Instance provides new endpoint:
- Analyzer skips upload if
sbom_scan_idis returned with a201and goes on to fetch results with thatsbom_scan_id
What does success look like
SBOM scans for the same SBOM in the project are re-used, reducing duplicate processing while maintaining security coverage freshness. The analyzer detects identical SBOMs before upload, the instance serves cached results when safe, infrastructure resource contention and load is reduced and scan times are shortened.
Outcomes
- Fewer redundant scans triggered for the same SBOM
- Fewer redundant uploads for the same SBOM
Functional Validation
- Analyzer generates a digest which is a true representation of the SBOM elements which would be used in the scan
- Analyzer calls caching endpoint correctly before starting upload
- Instance returns
201withsbom_scan_idwhen cached result exists and advisories are fresh - Instance returns
404when no cache exists or when new advisories have been published - Analyzer skips upload and proceeds to result fetching when receiving
201 - Instance validates advisory freshness via
PackageMetadata::Checkpoint
Data integrity validation
- New table and columns added are properly indexes and constrained to
projectresults - Multiple scans can safely reference the same
SbomScanResult - Orphaned scan results are properly cleaned up, destroy service only removes the scan when there are other references to the result
Observability validation
- Metrics emitted for cache hit/miss rate per project
- Metrics track scan result reuse count and advisory staleness detection
- SBOM Scan API Dashboard updated to show cache effectiveness across projects
Rollout validation
- Feature flag to fully control caching behaviour (when ff disabled the endpoint always gives
404) - Gradual rollout to
gitlab-orgfirst then to other projects on GitLab.com - Review metrics after each rollout iteration for rise in error rates, timeouts, and reduction in resource contention
Risks
Several risks categories are possible.
- Caching: Cache is not being hit and re-scans are occurring for SBOMs with finished scans
- DB size: In the old code path every completed dependency scanning job creates a
SbomScanrecord. This should not change with caching, but an incorrect caching flow (stale records, incorrect caching flow) may end up creating more records and growing this table unacceptably. - Stale results: If new security advisories are published after a scan completes, serving cached results could miss newly discovered vulnerabilities.
- SBOM digest consistency:
- Digest collisions could theoretically cause different SBOMs to be treated as identical
- Inconsistent digest parametrization (e.g.
purlkeeps changing even though component is the same) could lead to incorrect cash misses - Digest implementation change could create inconsistencies between instance and analyzer (e.g. analyzer uses new version of digest but not instance)
- Wrong scan result served: Scan results will be moved to a new model and have a different object storage path. This could lead to invalid (missing) scan results.
- Concurrent requests for same digest: Multiple jobs scanning identical SBOMs simultaneously may create race condition and cause duplicate processing.
| Risk | How to address | What to monitor |
|---|---|---|
| Ineffective caching |
|
|
| DB growth from new records |
|
|
| Stale results served |
|
|
| SBOM digest consistency |
|
|
| Wrong scan result served |
|
|
| Concurrent requests for same digest |
|
Scope
- Result reuse will be done only on a project level.
- Security constraints haven't been evaluated for group and instance levels (and aren't really feasible anyways because same sboms are much likelier in the same project).
- Template level change detection and result reuse is out of scope (e.g. caching the security report itself keyed by the supported files present).
Implementation plan
SBOM digest
Digest should uniquely identify the dependencies in an sbom. This implies a stable ordered and collision resistant approach. All DS analyzer SBOMs have a purl attribute which uniquely identifies the component.
- use
purlattribute - lexicographically sorted
- includes qualifiers
- if
purlis not present, then thesbomcache call should be skipped (TBD)
Advisory DB Freshness
- for each
purl_typein list supplied by client - find
sbom_scanwith givensbom_digest - query redis cache for
sequenceof thatpurl_type - if redis does not have this entry, query the
pm_checkpointstable- store
sequencevalue ofcheckpoint.purl_typein redis cache
- store
- if
sequenceof anypurl_typeis newer thansbom_scan.created_atthesbom_scanis not considered fresh- edge case: if checkpoint for given
purl_typedid not exist, then thesbom_scanis considered fresh
- edge case: if checkpoint for given
Race conditions for scans
Only scans in finished state are considered. When querying for an sbom_scan matching a subject sbom_digest, any scans in other states (e.g. created, running) will not be considered.
This means that several concurrent requests to scan an sbom will create a redundant scan. Because this concurrent scenario is an edge case and handling of race conditions and result expiry is more complicated, the redundancy is acceptable.
Relationship between scans and results
An SbomScanResult can have many SbomScans.
Migration
Because scans are ephemeral, no data migration is necessary. The old and new code paths are supported and can be removed once the (currently) 2-day TTL window expires. Since this feature is currently only available on GitLab.com, there are contingencies for dedicated types of instances.
Feature flag
When disabled, the instance should always serve a 404 from the cache endpoint. The sbom scan GET endpoint should serve a 410 requiring client to re-enter request process.
Plan
Update sbom scan processing to add a check on whether an identical sbom has been scanned using an sbom hash as an identifier for the scan result.
- Analyzer
-
vulnerabilitymodule (before upload, for each sbom)- generate a digest of an sbom's components
- call GitLab instance's scan result endpoint with the
sbom_digestandpurl_typespresent in sbom - if response is a 201, skip upload and go to fetching scan result with returned
sbom_scan_id - emit observability event when cache found or not found
-
- GitLab instance
- API
- add endpoint for
POST /api/v4/jobs/:job_id/sbom_scans/:sbom_digest- call
SbomScanResultCachingServicewithsbom_digestandpurl_types - if
sbom_scanreturned respond with201and thesbom_scan_id - if feature flag is disabled always return
404
- call
- add endpoint for
- Services
- add
SbomScanResultCachingService- look up
SbomScan.where(digest: digest, state: finished) - when found check whether there is fresher advisory data
- check advisory freshness by fetching relevant checkpoints
PackageMetadata::Checkpoint.where(data_type: :advisories, purl_type: purl_types)- for each checkpoint check advisory data freshness
- return nil if new advisories have been ingested since the scan finished:
checkpoint.sequence > sbom_scan.created_at
- return nil if new advisories have been ingested since the scan finished:
- return sbom scan record
- for each checkpoint check advisory data freshness
- check advisory freshness by fetching relevant checkpoints
- create
SbomScancopy with the following parameters- `new_scan = SbomScan.create(digest: found_scan.digest, result: found_scan.result)
- return
new_scan
- look up
- Update
ProcessSbomScanService- create new record to capture uploaded scan result (replace existing)
result = SbomScanResult.create(file: result_file, project: sbom_scan.project)- add result to sbom scan and set state to finished
sbom_scan.update(result: result, state: :finished)
- update
DestroySbomScanService- destroy sbom scans as before but do not remove result unless it's orphaned
results = sbom_scans.map { |scan| scan.result }results.each { |result| SbomScan.count(result_id: result.id) == 0 && result.delete }
- add
- Data layer
- add migration (DDL only)
- add
sbom_vulnerability_scan_resultstable- columns: project_id, file_store, file
-
update
sbom_vulnerability_scanstable- add column
sbom_digest - add column
sbom_scan_result_id - add index on
project_id, sbom_digest
- add column
- add
- add
SbomScanResultmodel- mount_uploader on
fileattribute onSbomScanUploader - add
delete_from_storagemethod for cleaning up uploads - override
hashed_pathmethod in uploader to storeproject_idandmodel_id
- mount_uploader on
- update
SbomScanmodel- add
has_onerelationship forSbomScanResult - add scope for fetching by
sbom_digest - params:
project_id,sbom_digest - filter:
status == finished
- add
- add migration (DDL only)
- API
Only one issue is needed, but the work can be broken up into several MRs. Roughly following the main points in the implementation plan above.
Sequence of messages with caching
sequenceDiagram
participant analyzer as Dependency Scanning analyzer
participant instance as GitLab Rails Backend
participant database as SbomScan model
analyzer->>+instance: POST /api/v4/jobs/:job_id/sbom_scans/:sbom_digest
instance->>database: SbomScan.for_sbom_digest(sbom_digest)
alt sbom_digest exists
database-->>instance: scan1
instance->>instance: scan2 = SbomScan.create(result: scan1.result)
instance-->>-analyzer: status=201, response={ "download_url": "scan2.download_url"}
analyzer-->>instance: GET response["download_url"]
else sbom_digest does not exists
Note over instance: follows existing path sbom upload and enqueuing background processing
database-->>instance: nil
instance-->>analyzer: status=404
analyzer-->>instance: upload SBOM
end
Observability events
Instance
- Cache Check Endpoint (
POST /api/v4/jobs/:job_id/sbom_scans/:sbom_digest)- Cache hit event
- by project, purl type
- Cache miss event
- by project, purl type
- Response time
- Cache hit event
- Sbom Scan
- SbomScanResultCachingService
- Digest lookup time
- Stale result found (e.g. result exists, but advisories updated)
- Fresh result found (e.g. result exists, and no new advisories)
- ProcessSbomScanService
- Ensure result upload events
- DestroySbomScanService
- sbom scan deleted, result oprhaned (bool)
- file cleaned up
- Other?
- Api rate limit on endpoint
Instance
- Cache hit/miss event
- Errors from digest generation
- Errors from cache endpoint
Rollout plan
Because SbomScan records expire after 2 days, it is simpler to support both storage locations for at least one full expiry window after turning the feature flag on globally and then removing the legacy code path. This is in contrast to a migration which will involve moving files in object storage (or instance storage) along with db migrations.
There are 2 storage locations:
- Current:
SbomScan.result_fileis an attribute on theSbomScanmodel mounted withcarrierwave - New code path:
SbomScan.resultis an active record association toSbomScanResultwhich hasSbomScanResult.result_fileattribute mounted withcarrierwave
The rollout without data migration involves selectively serving the sbom scan results from either location. And then after globally rolling out the new code path (and waiting at least the minimum expiry period for SbomScan), removing the conditional.
The rollout would look like:
- Stage 1: Deploy new code supporting dual read (
SbomScan.result_fileattribute andSbomScan.resultassociation) but writing only toSbomScan.result_fileattribute. - Stage 2a (enable feature flag for testing): Continue dual reads. SBOM scan result processing for projects with FF enabled are now writing
SbomScan.resultassociation while others writeSbomScan.result_file. - Stage 2b (enable feature flag globally): – Continue dual reads. SBOM scan result processing writing only
Sbomscan.result. - Stage3 (after 2 days
SbomScanTTL or next milestone) – Update code to read only fromSbomScan.result.
stateDiagram-v2
direction LR
[*] --> Stage1
Stage1: 18.6 - Original code path (without caching)
state Stage1 {
direction LR
[*] --> upload1
upload1: Upload SBOM artifact
upload1 --> process1
process1: Process SBOM scan
process1 --> store1
store1: Store result in SbomScan.result_file
store1 --> api_return1
api_return1: API response - Use SbomScan.result_file
}
[*] --> Stage2a
Stage2a: 18.7 (ff on selectively) - Write SbomScan.result_file or SbomScan.result while Reading either
state Stage2a {
direction LR
[*] --> upload2a
upload2a: Upload SBOM artifact
upload2a --> process2a
process2a: Process SBOM scan
process2a --> store2a_ff_off: Feature flag OFF
process2a --> store2a_ff_on: FEature flag ON
store2a_ff_off: Store result in SbomScan.result_file
store2a_ff_on: Store result in SbomScan.result
store2a_ff_off --> api_return2a
store2a_ff_on --> api_return2a
api_return2a: API response
api_return2a --> api_return_attr2a: Feature flag OFF
api_return_attr2a: API response - Use SbomScan.result_file
api_return2a --> api_return_assoc2a: Feature flag ON
api_return_assoc2a: API response - Use SbomScan.result
}
[*] --> Stage2
Stage2: Stage 2b (18.7 ff on globally) – Write only SbomScan.result while Reading either SbomScan.result_file or SbomScan.result
state Stage2 {
direction LR
[*] --> upload2
upload2: Upload SBOM artifact
upload2 --> scan2
scan2: Process SBOM scan
scan2 --> api_return2
api_return2: API response
api_return2 --> api_return_attr2: Feature flag OFF
api_return_attr2: API response - Use SbomScan.result_file
api_return2 --> api_return_assoc2: Feature flag ON
api_return_assoc2: API response - Use SbomScanResult.file
}
[*] --> Stage3
Stage3: Stage 3 (18.8?) – New code path only
state Stage3 {
direction LR
[*] --> upload3
upload3: Upload SBOM artifact
upload3 --> scan3
scan3: Process SBOM scan
scan3 --> api_return3
api_return3: API response - Use SbomScanResult.file
}
Testing plan
Unit testing
Instance
- Unit tests for all new and modified functionality
Analyzer
- Unit tests for all new and modified functionality
- Unit tests for stability of digest of existing (and possible additional) test SBOMs for each purl type
Integration testing
Analyzer
- Add integration tests for the cache miss/hit result loop with mocked api server
User Acceptance testing
Instance
- Test cache miss/hit scenario
- Verify pipeline security tab
- Verify vulnerability report
- Test with supported purl types
References
- SBOM scan api usage dashboard (internal)
- Projected usage calculations (internal)