deps.dev vs external license db findings

Purpose

The purpose of this issue is to highlight some findings between the external license database and the deps.dev dataset.

An analysis was conducted on March 8th, 2023 to compare the Go, Maven, NPM and Pypi data sets against the current external license database to determine any gaps in coverage.

Process

Data from the deps.dev data set was downloaded to a remote bucket:

EXPORT DATA OPTIONS(
uri='gs://deps-dev-data/deps-distinct/*.csv',
format='CSV',
overwrite=true,
header=true,
field_delimiter=';') AS
SELECT System,Name,Version,Licenses FROM `bigquery-public-data.deps_dev_v1.PackageVersions`, UNNEST(Licenses) AS Licenses group by System,Name,Version,Licenses;

These csv files were then merged, sort -u'd into their respective language/package manager csv files. deps.dev marks unknown licenses as non-standard, these were converted to license-dbs format of unknown:

  • sed -i '' -e 's/non-standard/unknown/g' sorted_pypi.csv

Data from the external license database was downloaded, merged and sorted into their own csv files as well.

A unified diff was created for review:

  • diff -u ./licensedb/sorted_pypi.csv ./deps-dev/result/sorted_pypi.csv > pypi.diff

//cc @dbolkensteyn @fcatteau @brytannia

Edited by Isaac Dawson