Package with too many versions lead to inconsistent exported data in License DB and unknown results for License Scanning
Summary
This problem was initially discovered with Golang packages having an unknown
license reported by License Scanning. This happens in situations where the package path in the sbom artifact might not be exactly the same as the real package path with respect to case sensitivity. For example:
-
github.com/DataDog/datadog-agent/pkg/proto
for0.50.0-rc.1
will show up as having an unknown license. - At the same time
github.com/datadog/datadog-agent/pkg/proto
for0.50.0-rc.1
hasApache-2.0
Querying the public license bucket
Searching (ignoring case) github.com/datadog/datadog-agent/pkg/proto
shows that have the required version only for github.com/datadog/datadog-agent/pkg/proto
and not for github.com/DataDog/datadog-agent/pkg/proto
grep -r -i "github.com/datadog/datadog-agent/pkg/proto" | grep "\"0.50.0-rc.1\""
./1699974179/000000001.ndjson:{"name":"github.com/datadog/datadog-agent/pkg/proto","lowest_version":"0.46.0-20230513-devel","highest_version":"0.50.0-rc.1","default_licenses":["Apache-2.0"]}
At this point if I create a go project I can only import github.com/DataDog/datadog-agent/pkg/proto
. So as a user I cannot use github.com/datadog/datadog-agent/pkg/proto
. Also the package path in the go.mod
file is declared as github.com/DataDog/datadog-agent/pkg/proto
.
Though, diving into the problem we've discovered a limitation with the export logic which lead to skipping some packages that would never be synced with the rails platform.
Steps to reproduce
- Create a go project
- Import
github.com/DataDog/datadog-agent/pkg/proto
onv0.50.0-rc.1
- Enabled DS
- Check the pipeline's licenses tab
Zendesk ticket - internal only
Example Project
See example packages in this request for help (internal): https://gitlab.com/gitlab-com/sec-sub-department/section-sec-request-for-help/-/issues/175#note_1774672495
What is the current bug behavior?
Some Golang packages show up as an unknown
license
What is the expected correct behavior?
These packages should show a valid license
Relevant logs and/or screenshots
Proposal
In the PMDB we perform golang package name normalization. The main problem facing is the fact that the exporter attempts to export packages with multiple versions and fails due to schema validation. We should fix this issue by truncating other_licenses
versions to the current max defined by the license schema.
Previously we discussed various fixes without knowing that we were missing packages in PMDB due to validation failures. For historical reasons you can see past proposals below.
Previous possible fixes
Proposal 1
- The license-feeder stays case sensitive so that we keep a good traceability of the data we are gathering. The downside is that we will possibly store multiple variants of the same module, with different cases in the Package Metadata DB. The DB usage should not be a concern though as this is an independent DB from the GitLab instance.
- The license-exporter should be modified to export all the data we have for all case variants. This has the downside of exporting and syncing more data with the rails application (DB usage might be more of a concern here, probably worth checking the impact). The upside is that it offers more opportunities to adjust the logic downstream to enhance the UX around case handling, as all the data is available locally.
- The rails platform matching logic should be modified to do case insensitive lookup. As all variants are exported by the license-exporter, then the rails logic should be responsible for the merge and dedupe of results (rules TBD).
Proposal 2
- (Same as proposal 1) The license-feeder stays case sensitive so that we keep a good traceability of the data we are gathering. The downside is that we will possibly store multiple variants of the same module, with different cases in the Package Metadata DB. The DB usage should not be a concern though as this is an independent DB from the GitLab instance.
- The license-exporter should be modified to only export a single variant and should be responsible for the merge and dedupe of all the variants we have in the Package Metadata DB (rules TBD)
- The rails platform matching logic should be modified to do case insensitive lookup. As a single variant is exported, the DB lookup will return a uniq result which is the currently expected behavior of that codebase (so this is a simpler change in the rails source code). The rails DB should probably be migrated to clean up other variants that will no longer be synced and used though.
Proposal 3
- The license-feeder (or license-processor?) becomes case insensitive and stores only a single variant in the Package Metadata DB. The feeder (or processor) is then be responsible for the merge and dedupe of the original data we gather. There should be no noticeable performance hit and this workflow is decoupled from the user facing features and workflows. The DB must be cleaned up and existing records merged together. The downside is that we lose a bit of traceability of the data we are gathering. This could possibly be aleviated with additional observability if necessary. This looks like a faster change for short term results but with a higher cost if we change our mind down the road.
- The license-exporter probably needs no change.
- (Same as proposal 3) The rails platform matching logic should be modified to do case insensitive lookup. As a single variant is exported, the DB lookup will return a uniq result which is the currently expected behavior of that codebase (so this is a simpler change in the rails source code). The rails DB should probably be migrated to clean up other variants that will no longer be synced and used though.