Find software license in repository when NPM returns an empty license

Release notes

TODO

Problem to solve

The NPM registry lists the software license for a package based on the license declared in the package.json field. If the field is empty, then the package is assumed to have no license which may be incorrect. For example, some packages check-in a file at the root of the repository e.g. LICENSE or LICENSE.md, and completely ignore the package.json manifest when it comes to declaring their license terms.

Proposal

Use a hybrid approach where we use two methods (registry and file based) to detect the license used and increase the accuracy + recall of our license scanning for NPM projects. The following edge cases should be handled:

License file present? License in package.json present? Decision
Yes No Use license in license file.
No Yes Use license in package.json.
Yes Yes Use license in package.json

Looking at the golang interfacer implementation, the interfacer looks for files in a directory that are known to contain licenses. It then uses this list to inspect the files if they exist and classify the license(s) included.

A similar approach could be had for NPM packages:

  • Check if the registry contains an entry for the license in package.json. This can be determined by checking if the response to the request for the package's metadata returns a license. If it does, then use that as the known license.
  • If the registry does not contain an entry for the license, then check for a license file and use the classifier to determine what the licenses are for the package. Use these licenses as the known licenses for the package.

Risks

The license files are all read into memory and then passed to the classifier will stay in memory until the garbage collector runs. If this results in high memory usage, thrashing can occur and degrade the performance of the npm interfacer. Profiling runs can help determine if the classifier introduces heavy memory consumption. Some other options include configuring the garbage collection and reusing memory buffers to prevent extra allocations.

Feature Usage Metrics

TODO

Implementation Plan

TODO

This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.

Edited by Oscar Tovar