As we transition from CI based security scanning to SBOM based scans in the rails backend, we're currently doing the work twice for pipelines with Jobs uploading both security reports and SBOM reports. This is the case for our Dependency Scanning and Container Scanning CI jobs.
The ingestion process could be described as follow (simplified):
CI artifacts are uploaded
pipeline completes
security report artifacts are ingested
SBOM artifacts are ingested
SBOM based security scans are triggered and the generated vulnerabilities are ingested
There should be no duplicate vulnerabilities in the project as the CI based reports and the SBOM based scans are relying on the same data sources for both the security advisories and for the list of components (there might be edge cases where CI scan configuration might generate different results though). So when reaching the step 5, the vulnerabilities generated by the SBOM scan have already been created at step 3.
While we have not heard negative feeback from users yet about this approach, there is a concern that this causes useless resource usage and additional traffic on the vulnerability management tables which are already sufferring from too many updates and already went through optimizations: Reduce the number of tuples updated for tables ... (&13616 - closed)
We should estimate this cost and decide if it is acceptable to assume it until %18.0 when security reports will be removed from ou CI based jobs.
Designs
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related or that one is blocking others.
Learn more.
@tkopel FYI I'm tentatively pushing this into %17.7 as we should be careful about the pressure we add on the vulnerability management system and don't ruin the efforts done there.
I set P2/S2 for now but we might lower the severity depending on the actual impact.
There should be no particular problem between CS and DS as these are still separated in our vulnerability management system, so we had to maintain that separation in our SBOM processing AFAIR. So when we ingest an SBOM report, the GitLab taxonomy properties inform on what type of vulnerability should be created. In other words, even when enabling scan for language packages in CS, all vulnerabilities created are of CS type.
That said, it seems we've introduced a purl_type based logic which might produce DS vulnerabilities when scanning a non OS package component coming from an SBOM report generated by trivy This seems to be part of the recent CVS on SBOM change workflow, but this is currently not enabled for CS. Though we have to verify this before we enable it
@gonzoyumo So the problem is that customers that are running Gemnasium DS, that outputs both a cycloneDX report and a security report will get duplicated vulnerabilities?
@tkopel fortunately no, this would have been a no-go.
There should be no duplicate vulnerabilities in the project as the CI based reports and the SBOM based scans are relying on the same data sources for both the security advisories and for the list of components (there might be edge cases where CI scan configuration might generate different results though). So when reaching the step 5, the vulnerabilities generated by the SBOM scan have already been created at step 3.
@gonzoyumo@tkopel if I understand correctly, it seems like we're mainly dealing with a potential performance issue here, when doing the same work twice with both CI and SBOM scans.
We could check out:
DB impact (# of unnecessary queries, execution time)
Whether pipelines are significantly slower because of this
If there's some quick win like skipping the second scan if we already did one
But, since this is all going away in 18.0 anyway, I wonder how much effort we should put into this
@tkopel in that case, maybe we should document this behavior and its implications for customers who will continue using deprecated analyzers post 18.0.
If we decide to address this, I think we can proceed in 2 directions:
@onaaman@tkopel I agree that post 18.0 this should only amount for a small part of usages. Aside from what you mentionned (usage of the then deprecated Gemnasium scanner), this will impact users who want to provide both SBOM reports and custom DS reports (3rd party integration). Though, they will also have to keep the Security Scan of SBOM report enabled, which IMO is not really relevant if they want to use a 3rd party scanner in the first place
This issue was initially intended to be a verification prior to enabling the feature globally It is mainly relevant for the time period between now and 18.0 IMO.
One thing to keep in mind is that this is impacting downstream features belonging to other groups. To me, this is similar to what we've discussed in the retrospective meeting today. I think we should at least inform them and get their feedback on the matter. But they are likely to ask what's the added overhead ^^
This groupcomposition analysis bug has at most 50% of the SLO duration remaining and is an SLONear Miss breach. Please consider taking action before this becomes an SLOMissed in 11 days (2024-12-23).
@gonzoyumo Thanks for bringing this to our attention
While investigating the performance impact of duplicate scanning is important, I think this CVSS issue highlights a more critical concern: since security reports will be deprecated in 18.0, any data discrepancies we find now (like missing CVSS) indicate data we'll permanently lose after the transition to SBOM-only scanning.
This is not just about performance issues until 18.0, but about long-term data integrity
I think we should address both aspects:
Evaluate the performance impact of duplicate scanning
More critically: ensure SBOM-based scanning captures all data currently provided by security reports
Regarding the first aspect (performance impact):
I did a small experiment on my GDK with a project containing 162 vulnerabilities, and observed the performance metrics in the logs for 3 workers:
The significant difference in query counts suggests that while we are indeed doing duplicate vulnerability processing (as expected), the SBOM-based scan (ProcessVulnerabilitiesWorker) is much more efficient in terms of DB operations - it's doing no writes at all, likely because the vulnerabilities already exist from the first scan. However, we're still using cpu resources (1.78s of CPU time) and db reads unnecessarily.
But, if the CVSS issue is indeed caused by SBOM ingestion overwriting existing data (as updates are counted as write operations), then the 0 writes metric is unexpected @zmartins maybe you have any idea why this is happening?
Regarding the second aspect (data integrity):
I suggest we open another issue to ensure data parity between CI-based and SBOM-based ingestion (as this is not an impact of the duplication)
I'm not super familiar with the flow but I think we should write tests based on the vulnerability schema and compare old and new ingestion results to verify we're not losing any data.
@tkopel@gonzoyumo@zmartins I'll be happy to hear your thoughts on what you think our next steps should be here :)
Your assessment is correct as CVSS isn't currently being set within the CVS workflow. I believe that as long as the respective PackageMetadata::Advisory has either cvss_v2 or cvss_v3 columns set we should be able to use it.
My suggestion for the following steps is to have a issue covering any attributes that might be missing from the CVS workflow.
@onaaman 100% agree with your statement Thank you for putting it so clearly.
But, if the CVSS issue is indeed caused by SBOM ingestion overwriting existing data (as updates are counted as write operations), then the 0 writes metric is unexpected
What happens if your project only uploads an SBOM report artifact (no security reports)? Do you have vulnerabilities created then? My guess is that your GDK may not have synced the advisories from PMDB required to create vulnerabilities from the SBOM . This behavior is not enabled by default on the GDK (to avoid pulling the license and advisory data on every GDK instance). BTW we might also want to make this more visible as starting with 18.0 this will be the only way to get DS (and CS) vulnerabilities created with the default features (we can still craft DS and CS reports to create vulns for testing though).
I populated my GDK with PMDB's advisory data and indeed observed a db_writes for the third one (Sbom::ProcessVulnerabilitiesWorker) after the GDK was populated. So your guess was correct!
Click to expand - Rerun performance experiment with a populated GDK
Attaching the logs from the 3 workers:
Here's a visualization comparing the base process (blue) vs the additional overhead (red) for each metric when processing a project with 162 vulnerabilities (Of course it's a local test so the results might differ in prod, this is just to get a sense of scale):
Assuming this experiment is accurate, I believe the most concerning overhead is the db_writes (we are doing almost 3x more writes)
Though the processing time (+58.0%) and CPU (+59%) overheads are also significant
@gonzoyumo Do you think this experiment is sufficient? Should we plan one in production? Maybe we should start considering mitigations?
Also - maybe the db_writes overhead is something that we should think of separately from this duplication question, we might want to look into optimizing this independently (unless there are more actions the Sbom::ProcessVulnerabilitiesWorker worker does that might explain this gap).
@tkopel I understand the high level risk but I don't have enough expertise on this part to make an informed decision. The numbers shown here are pretty alarming Though, such overhead is relative and specific to Dependency Scanning, it doesn't speak to the actual impact on all security scans that vulnerability management has to ingest. IMO this very much requires the attention from groupsecurity insights as this impacts their product area. @nmccorrison is this a risk your team can help asserting and deem acceptable for the next 5 months? Once we reach 18.0 the duplicated ingestion will stop as we'll no longer generate the DS security report.
This issue has been opened since October and the double ingestion has been performed since then without noticeable problem. While it would have been preferable to prevent it, we now only have 2 months left before switching all users of the default DS feature to DS using SBOM only.
The remaining use cases where dual ingestion will be present will be marginal starting with 18.0 which makes this effort much less relevant.
I'm closing this and we can re-open if we re-adjust the migration path to keep dual ingestion until 19.0.