Spike: Assess golang package differences between depsdev and license-db

Topic to evaluate

The original research spike noted that there were 4 times as many package-versions in the license-db as deps.dev. This needs to be researched in order to be able to get deps.dev as a substitute for the current data source.

Proposal

Investigate the difference between the 2 data source and explain the disparity.

1 possible method is to spot check well known packages. Wherever a significant disparity is found, go through actual versions of a package at its source and check versions in the data source. Use this to find if there's a systematic error or categorization difference between the 2 data sources.

Tasks to Evaluate

Following the proposal above:

  • Generate data
    • Using latest snapshot group package and its version counts in the PackageVersions table for System=Go.
    • Do same as above using license-db.go_license table.
    • Export both as json.
  • Compare by joining both of the above sources on package name.
  • Analyze
    • Find packages with significant disparities.
    • For each package, compare versions (commits, tags) at the source against the data source.
  • Assess whether deps.dev is missing versions or whether license-db is over-classifying packages.

Timebox

2d

Conclusion

As of the date of closing of this issue, deps.dev is not usable as a source of golang licenses since pseudoversions are consistently missing from many go modules.

Edited by Igor Frenkel