Investigate the impact of duplicated ingestion of DS and CS vulnerabilities when using SBOM based scans

added Category:Software Composition Analysis Enterprise Edition GitLab Ultimate SCA:Dependency Scanning backend bugperformance devopssecure groupcomposition analysis sectionsec typebug labels

added priority2 severity2 labels

@tkopel FYI I'm tentatively pushing this into %17.7 as we should be careful about the pressure we add on the vulnerability management system and don't ruin the efforts done there.

I set P2/S2 for now but we might lower the severity depending on the actual impact.

@gonzoyumo Acknowledged and appreciated.

changed milestone to %17.7

added workflowrefinement label

mentioned in epic gitlab-org#14636 (closed)

mentioned in issue #502469 (closed)

@gonzoyumo When we have both a security report and a cycloneDX both will be scanned for vulnerabilities?

Because CS and DS should output different vulnerabilities (at least if the non-os packages flag isn't on)

@tkopel there is a duplicate effort of scanning and then ingesting vulnerabilities

DS CI job performs the SBOM generation and security analysis
security report is ingested in the vuln management system
SBOM report is ingested and SBOM based security analysis is triggered
resuls from SBOM based security analys are ingested in the vuln management system

So we waste resources in doing the "same" scan twice and then in trying to re-insert the same vulnerabilities. One user facing consequence of that is the issue SBOM based Scanning overrides the scanner value... (#502469 - closed)

There should be no particular problem between CS and DS as these are still separated in our vulnerability management system, so we had to maintain that separation in our SBOM processing AFAIR. So when we ingest an SBOM report, the GitLab taxonomy properties inform on what type of vulnerability should be created. In other words, even when enabling scan for language packages in CS, all vulnerabilities created are of CS type.

That said, it seems we've introduced a purl_type based logic which might produce DS vulnerabilities when scanning a non OS package component coming from an SBOM report generated by trivy This seems to be part of the recent CVS on SBOM change workflow, but this is currently not enabled for CS. Though we have to verify this before we enable it

@gonzoyumo So the problem is that customers that are running Gemnasium DS, that outputs both a cycloneDX report and a security report will get duplicated vulnerabilities?

@tkopel fortunately no, this would have been a no-go.

There should be no duplicate vulnerabilities in the project as the CI based reports and the SBOM based scans are relying on the same data sources for both the security advisories and for the list of components (there might be edge cases where CI scan configuration might generate different results though). So when reaching the step 5, the vulnerabilities generated by the SBOM scan have already been created at step 3.

@gonzoyumo @tkopel if I understand correctly, it seems like we're mainly dealing with a potential performance issue here, when doing the same work twice with both CI and SBOM scans.

We could check out:

DB impact (# of unnecessary queries, execution time)
Whether pipelines are significantly slower because of this
If there's some quick win like skipping the second scan if we already did one

But, since this is all going away in 18.0 anyway, I wonder how much effort we should put into this

@onaaman important to note that we're not deprecating security report ingestion, but customers may still do that using deprecated analyzers.

@tkopel in that case, maybe we should document this behavior and its implications for customers who will continue using deprecated analyzers post 18.0.

If we decide to address this, I think we can proceed in 2 directions:

Inspect impact (mainly on DB): Should investigate how to best measure this (might be related to the discussion in the 17.6 retrospective https://gitlab.com/gl-retrospectives/secure-sub-dept/composition-analysis/-/issues/52#note_2232694424)
Look into the code for potential quick wins, like skipping redundant scans when we detect both report types, if that's possible.

@onaaman @tkopel I agree that post 18.0 this should only amount for a small part of usages. Aside from what you mentionned (usage of the then deprecated Gemnasium scanner), this will impact users who want to provide both SBOM reports and custom DS reports (3rd party integration). Though, they will also have to keep the Security Scan of SBOM report enabled, which IMO is not really relevant if they want to use a 3rd party scanner in the first place

This issue was initially intended to be a verification prior to enabling the feature globally It is mainly relevant for the time period between now and 18.0 IMO.

One thing to keep in mind is that this is impacting downstream features belonging to other groups. To me, this is similar to what we've discussed in the retrospective meeting today. I think we should at least inform them and get their feedback on the matter. But they are likely to ask what's the added overhead ^^

assigned to @zmartins

mentioned in issue #482904 (closed)

added gitlab-org#15960 (closed) as parent epic

assigned to @onaaman and unassigned @zmartins

mentioned in issue #490334 (closed)

mentioned in merge request !171019 (merged)

changed milestone to %17.8

@tkopel, @johncrowley and groupcomposition analysis Software Engineer in Test,

This groupcomposition analysis bug has at most 50% of the SLO duration remaining and is an SLONear Miss breach. Please consider taking action before this becomes an SLOMissed in 11 days (2024-12-23).

changed due date to December 23, 2024

added SLONear Miss label

mentioned in epic gitlab-org#11544 (closed)

@onaaman @tkopel FYI another potential problem due to this double scan: &11544 (comment 2256740143)

If this is confirmed, we might need to check the parity between findings created via the rails logic and what's produced by the analyzers.

@gonzoyumo Thanks for bringing this to our attention

While investigating the performance impact of duplicate scanning is important, I think this CVSS issue highlights a more critical concern: since security reports will be deprecated in 18.0, any data discrepancies we find now (like missing CVSS) indicate data we'll permanently lose after the transition to SBOM-only scanning. This is not just about performance issues until 18.0, but about long-term data integrity

I think we should address both aspects:

Evaluate the performance impact of duplicate scanning
More critically: ensure SBOM-based scanning captures all data currently provided by security reports

Regarding the first aspect (performance impact):

I did a small experiment on my GDK with a project containing 162 vulnerabilities, and observed the performance metrics in the logs for 3 workers:

Security::StoreScansWorker

duration_s: 0.521817
cpu_s: 0.396677
db_count: 125
db_write_count: 35
db_cached_count: 10

Security::StoreSecurityReportsByProjectWorker

duration_s: 4.882388
cpu_s: 3.553201
db_count: 8190 
db_write_count: 295
db_cached_count: 4020

(BTW, The high number of cached queries is concerning but this is not relevant to the duplication issue so I didn't dive into that)

Sbom::ProcessVulnerabilitiesWorker

duration_s: 2.579177
cpu_s: 1.778932
db_count: 190
db_write_count: 0
db_cached_count: 10

The significant difference in query counts suggests that while we are indeed doing duplicate vulnerability processing (as expected), the SBOM-based scan (ProcessVulnerabilitiesWorker) is much more efficient in terms of DB operations - it's doing no writes at all, likely because the vulnerabilities already exist from the first scan. However, we're still using cpu resources (1.78s of CPU time) and db reads unnecessarily.

But, if the CVSS issue is indeed caused by SBOM ingestion overwriting existing data (as updates are counted as write operations), then the 0 writes metric is unexpected @zmartins maybe you have any idea why this is happening?

Regarding the second aspect (data integrity):

I suggest we open another issue to ensure data parity between CI-based and SBOM-based ingestion (as this is not an impact of the duplication)

I'm not super familiar with the flow but I think we should write tests based on the vulnerability schema and compare old and new ingestion results to verify we're not losing any data.

@tkopel @gonzoyumo @zmartins I'll be happy to hear your thoughts on what you think our next steps should be here :)

@onaaman Thank you for the thoughtful analysis, I'm all for testing to verify our ingestion. Also pulling in @hacks4oats.

@onaaman Thank you for looking into this issue.

Your assessment is correct as CVSS isn't currently being set within the CVS workflow. I believe that as long as the respective PackageMetadata::Advisory has either cvss_v2 or cvss_v3 columns set we should be able to use it.

My suggestion for the following steps is to have a issue covering any attributes that might be missing from the CVS workflow.

The CVSS gap is addressed in this MR: Set CVSS vectors during vulnerability scanning (!175756 - merged) • Brian Williams • 17.8, I'm just concerned about any other gaps we might have.

My suggestion for the following steps is to have a issue covering any attributes that might be missing from the CVS workflow.

~~I'll open an issue for that.~~

EDITING: I'm not sure we should open an issue as we currently have Update vulnerable SBOM occurrences after scanni... (#500551 - closed) • Zamir Martins • 17.9 • Needs attention and [Spike] Review tasks related to security report... (#501475 - closed) • Olivier Gonzalez • 17.8 which might cover the gaps. Does it sound reasonable @gonzoyumo?

@onaaman 100% agree with your statement Thank you for putting it so clearly.

But, if the CVSS issue is indeed caused by SBOM ingestion overwriting existing data (as updates are counted as write operations), then the 0 writes metric is unexpected

What happens if your project only uploads an SBOM report artifact (no security reports)? Do you have vulnerabilities created then? My guess is that your GDK may not have synced the advisories from PMDB required to create vulnerabilities from the SBOM . This behavior is not enabled by default on the GDK (to avoid pulling the license and advisory data on every GDK instance). BTW we might also want to make this more visible as starting with 18.0 this will be the only way to get DS (and CS) vulnerabilities created with the default features (we can still craft DS and CS reports to create vulns for testing though).

My guess is that your GDK may not have synced the advisories from PMDB required to create vulnerabilities from the SBOM

@gonzoyumo True, no vulnerabilities in that case

My guess is that your GDK may not have synced the advisories from PMDB required to create vulnerabilities from the SBOM

I think that's a good idea.

Can we manually enable PMDB's ingestion on the GDK?

@onaaman that clearly needs some visibility indeed It took me a few minutes to find it back, and I roughly knew where to look: https://gitlab.com/gitlab-org/gitlab-development-kit/-/blob/main/doc/howto/license_compliance.md?ref_type=heads#enable-license-scanning And this only convers the license scanning part. Syncing advisories calls for different workers. You might need to leave your GDK running for some time to complete the sync.

I'd suggest to describe this process in https://docs.gitlab.com/ee/development/sec/ instead. We might also review the user documentation to ensure we clearly indicate this requirement. E.g. we mention that for CVS but not for DS with SBOM.

I'll open follow up issues ASAP for theses.

EDIT: Added issues:

@gonzoyumo Thanks for opening this issues

I populated my GDK with PMDB's advisory data and indeed observed a db_writes for the third one (Sbom::ProcessVulnerabilitiesWorker) after the GDK was populated. So your guess was correct!

Click to expand - Rerun performance experiment with a populated GDK

Attaching the logs from the 3 workers:

Security::StoreScansWorker

duration_s: 0.708386
cpu_s: 0.414293
db_count: 26 
db_write_count: 7 
db_cached_count:2

Security::StoreSecurityReportsByProjectWorker

duration_s: 3.784813 
cpu_s: 3.270073 
db_count: 1639  
db_write_count: 59 
db_cached_count: 802

Sbom::ProcessVulnerabilitiesWorker

duration_s: 2.605642
cpu_s: 2.208724 
db_count: 458 
db_write_count: 184 
db_cached_count: 2

Here's a visualization comparing the base process (blue) vs the additional overhead (red) for each metric when processing a project with 162 vulnerabilities (Of course it's a local test so the results might differ in prod, this is just to get a sense of scale):

Assuming this experiment is accurate, I believe the most concerning overhead is the db_writes (we are doing almost 3x more writes)

Though the processing time (+58.0%) and CPU (+59%) overheads are also significant

@gonzoyumo Do you think this experiment is sufficient? Should we plan one in production? Maybe we should start considering mitigations?

Also - maybe the db_writes overhead is something that we should think of separately from this duplication question, we might want to look into optimizing this independently (unless there are more actions the Sbom::ProcessVulnerabilitiesWorker worker does that might explain this gap).

@gonzoyumo can you please help @onaaman bring this to a conclusion, or alternatively close this as we've now discovered the impact?

@onaaman thank you for the investigation.

@tkopel I understand the high level risk but I don't have enough expertise on this part to make an informed decision. The numbers shown here are pretty alarming Though, such overhead is relative and specific to Dependency Scanning, it doesn't speak to the actual impact on all security scans that vulnerability management has to ingest. IMO this very much requires the attention from groupsecurity insights as this impacts their product area. @nmccorrison is this a risk your team can help asserting and deem acceptable for the next 5 months? Once we reach 18.0 the duplicated ingestion will stop as we'll no longer generate the DS security report.

@gonzoyumo I think this need to involve groupsecurity infrastructure who are more responsible for the performance aspect. @schmil.monderer could you please have a look at the data @onaaman provided?

mentioned in issue #507925

mentioned in issue #508411 (closed)

added SLOMissed label and removed SLONear Miss label

changed milestone to %17.9

changed milestone to %17.8

added DS-deprecationCVS label

changed milestone to %17.9

added missed:17.8 label

added Stretch label

changed milestone to %17.10

added missed:17.9 label

This issue has been opened since October and the double ingestion has been performed since then without noticeable problem. While it would have been preferable to prevent it, we now only have 2 months left before switching all users of the default DS feature to DS using SBOM only.

The remaining use cases where dual ingestion will be present will be marginal starting with 18.0 which makes this effort much less relevant.

I'm closing this and we can re-open if we re-adjust the migration path to keep dual ingestion until 19.0.

cc @onaaman @tkopel

closed

added closedcomplete workflowcomplete labels and removed workflowrefinement label

changed milestone to %17.9

removed missed:17.9 label

Investigate the impact of duplicated ingestion of DS and CS vulnerabilities when using SBOM based scans

Proposal

Designs

Child items ...

Activity

Regarding the first aspect (performance impact):

Regarding the second aspect (data integrity):