deps.dev vs external license db findings
Purpose
The purpose of this issue is to highlight some findings between the external license database and the deps.dev dataset.
An analysis was conducted on March 8th, 2023 to compare the Go, Maven, NPM and Pypi data sets against the current external license database to determine any gaps in coverage.
Process
Data from the deps.dev data set was downloaded to a remote bucket:
EXPORT DATA OPTIONS(
uri='gs://deps-dev-data/deps-distinct/*.csv',
format='CSV',
overwrite=true,
header=true,
field_delimiter=';') AS
SELECT System,Name,Version,Licenses FROM `bigquery-public-data.deps_dev_v1.PackageVersions`, UNNEST(Licenses) AS Licenses group by System,Name,Version,Licenses;
These csv files were then merged, sort -u'd into their respective language/package manager csv files. deps.dev marks unknown licenses as non-standard, these were converted to license-dbs format of unknown:
sed -i '' -e 's/non-standard/unknown/g' sorted_pypi.csv
Data from the external license database was downloaded, merged and sorted into their own csv files as well.
A unified diff was created for review:
diff -u ./licensedb/sorted_pypi.csv ./deps-dev/result/sorted_pypi.csv > pypi.diff
Edited by Isaac Dawson