Identify issues when running offline with License Finder tooling
- Crack open the relevant package manager file from here.
- Read the docs for the package manager to understand how to tell the package manager to pull packages from a custom source. (.i.e authenticated/unauthenticated package registry.
- Use GitLab Package Registry to help with development/testing.
- Ship new docker image to airgap environment. (currently: a VERY manual step)
Problem to solve
We want to be able to detect software licenses for dependencies associated with a software project. An ideal solution would allow detection in an air gap environment.
The current tool that we are using to detect software licenses associated with a software project is LicenseFinder.
LicenseFinder operates by using the package manager associated with different lock files.
license_finder detects a
Gemfile.lock then it will use bundler to resolve dependencies and list out the software licenses associated with it.
BUNDLE_PATH=. bundle installto install gems.
*.gemspecfiles to find the
licensesfor a gem.
bundle install step requires that packages be installed locally so that the associated
can be parsed.
If we want to avoid any network traffic at all, then we would need to remove the
bundle install step and identify
the software dependencies using the
Gemfile.lock. If packages aren't installed on the system then we would need
to source that information from data that resides within the scan environment.
If local area network traffic is allowed, then the author of the
Gemfile/Gemfile.lock would need to specify a different source.
source 'https://gemdist-int.example.org' gem 'rails'
In this particular example we're able to control the destination server to be used to retrieve ruby gems. Each package manager is slightly different.
Let's look at Java. For
Java LicenseFinder supports
mvn via plugins.
To get a list of software licenses for a gradle project, the
gradle.build must install the
license-gradle-plugin to resolve software licenses as a set of
LicenseFinder then parses these XML files to get the software licenses
associated with each package. More details can be found here.
mvn requires a different plugin.
A more comprehensive writeup can be found here.
If the software dependencies can be installed before the
license_scanning job runs, then
it's possible to detect licenses for some project types.
license_scanning: variables: LICENSE_FINDER_CLI_OPTS: '--no-prepare' before_script: - bundle install --local
If we can ask customers to keep their licensed cache up to date, then we'll just parse the cache data. This shifts the burden from us to then project owner.
This is a neat idea. It packs a list of known licenses into a specialed database file. Then provides a tool for
LICENSE text file with the licenses found in the specialed database. This requres that all
packages are installed in the scan environment and that we can detect the
LICENSE file for each of the dependencies.
This doesn't remove the package installation step and still has similar issues to the existing solution. However, the offline database of license files can help.
- Needs research
See Links / references section below for additional options.
git to fetch the latest version of the ruby-advisory-db.
This is nice because it allows for distribution of an offline version of the ruby-advisory-db.
If someone wants to update, they can but aren't forced to. This caters to an air gap environment because scanning of software can still take
place but the knowledge of vulnerabilities is only as good as the version of the ruby-advisory-db.
Fetching the lastest advisories is preferred but not necessary.
Sharing the advisories as a git database makes it very easy to push/pull changes because there is lots of tooling and existing CDN's to distribute this information.
If the air gap environment allows for installation of packages via private registries, then we can look at tweaking each individual package manager class from within LicenseFinder. This is not guaranteed to work but it might allow us to gain a few wins as we get deeper understanding.
An alternative solution that I have been exploring is the idea around building an offline index for every known dependency name/version. This index could then be embedded in a docker image so that scanning would not need to reach out to the internet.
Can I detect the name and version of a dependency without relying on a package manager for that language tool?
.slnfile to find a list of
.csprojfile to find a list of dependencies. Sorry, I have touched
- I haven't looked at
Gemfile.lockcan be parsed but has some unique edge cases like local path references, git etc.
Pipfile.lockcan be parsed easily. It's JSON. We can parse JSON.
Can I determine the software license associated with a
name/version of a dependency?
- .NET: The
api.nuget.orgprovides these details.
- Java: Downloading a spec file from 'https://repo.maven.apache.org/maven2' is easy.
- Ruby: rubygems.org provides a JSON API for fetching dependency metadata.
- Python: pypi.org provides a API for fetching dependency metadata.
Can I pre-compute the license for every package/version for every package manager?
- .NET: Yes.
- Java: Not sure yet.
- Ruby: Almost. rubygems.org offers nightly backups of the pg database which makes it easy to pick up new packages/licenses.
- Python: Not sure yet.
$ git clone https://gitlab.com/gitlab-org/gitlab.git $ gem install spandx -v 0.9.0 $ spandx index update # pull latest offline index $ spandx scan gitlab/Gemfile.lock > no-airgap.json $ spandx scan gitlab/Gemfile.lock -a > airgap.json $ vimdiff no-airgap.json airgap.json
Or for a docker example:
モ docker run --entrypoint='' -it mokhan/spandx:latest /bin/sh $ git clone https://gitlab.com/gitlab-org/gitlab.git $ spandx scan gitlab/Gemfile.lock > no-airgap.json $ spandx scan gitlab/Gemfile.lock -a > airgap.json
It's ugly, it's a crazy idea, it will need help, and there are probably more reasons to say no rather than
It has one advantage: it is available now.
How big is this offline index?
I threw together some high level #'s here.
So far indexes 136287 entries occupies about 5 MB of disk space. This is not compressed and not optimized for machine consumption.
モ du -sh ~/.local/share/mokhan/spandx-rubygems/lib/spandx/rubygems/index/ 5.0M /home/mokha/.local/share/mokhan/spandx-rubygems/lib/spandx/rubygems/index/ 5.0M total モ wc -l ~/.local/share/mokhan/spandx-rubygems/lib/spandx/rubygems/index/**/data 136287 total
Availability & Testing
The efficacy of the offline catalogue needs improvement.
I know we have expertise with maintaining the
gemnasium-db so that knowledge could be utilized for
building the offline index.
What does success look like, and how can we measure that?
- License Compliance
- Code Status :ci_running:
- Proof of Concept for work around :ci_running:
- Documentation Status - :ci_pending:
- QA Status - :ci_pending:
- Demo Status - :ci_running: