Align codebase of all 3 gemnasium analyzers

Summary

There are important discrepancies in the codebase of the 3 Gemnasium analyzer projects. These discrepancies need to be removed so that these 3 projects can be eventually share the same codebase, and be merged into a single one.

Further details

  • gemnasium-maven and gemnasium-python depends on common/command, but gemnasium doesn't.
  • gemnasium uses Scanner.ScanDir but gemnasium-python uses ScanReader and gemnasium-maven uses ScanFile.
  • As a consequence, gemnasium is the only project that uses scanner/finder package, and other projects rely on custom detection logic.
  • gemnasium-maven and gemnasium-python both run CLI commands to generate file that can be parsed, whereas gemnasium directly parses supported files.
  • gemnasium-maven and gemnasium-python look for supported files in a specific order, whereas gemnasium doesn't. In the case of gemnasium-python, default pip files win over setuptools files, and PIP_REQUIREMENTS_FILE wins over everything else.
  • gemnasium scans all the supported files whereas gemnasium-maven and gemnasium-python stops after the first match. gemnasium-maven could technically process multiple files but this would be a behavior change. gemnasium-python couldn't because of other limitations - scanned project would leak its dependencies to the next ones.

Links

Proposal

  • add structs that represent the package managers as well as the files they handle
  • implement a generic detection logic that leverages these structs and return supported projects; it can be configured to maintain the behavior of gemnasium-maven and gemnasium-python
  • introduce "builders" as an abstraction for commands executed to export the dependencies to a file Gemnasium can parse (execution of the Gemnasium Maven plugin, pipenv graph, etc.)
  • align all CLIs so that they use the same detection logic, build the projects that need to be built (where applicable), and scan them

See initial proposal

Share as much code as possible, and let gemnasium-python and gemnasium-maven implement specific plugins:

  • builder plugins, to build the project and generate files that can be parsed
  • parser plugins, to parse the lock file or dependency graph, and extract a list of packages
  • vrange plugins, to check if a version is included in a range

Each plugin system is implemented in the main gemnasium project, and this is where the plugin registry lives.

Each plugin lives with the project where its used:

  • new builder plugins are introduced in gemnasium-maven and gemnasium-python
  • some existing vrange plugins are moved
  • some existing scanner/parser plugins are moved

For instance, vrange/python moves to gemnasium-python, and scanner/parser/mvnplugin moves to gemnasium-maven.

The gemnasium project still builds the gemnasium Docker image, but it uses the same API as gemnasium-python and gemnasium-maven, for consistency. In the long term, shared code could be extracted into a separate repository.

The scanner package becomes generic, and only has two exported methods:

  • to walk a directory, and build a list of files to be processed
  • to process that list of files, using compatible builders and parsers

The walk function walks the given directory, and finds no more than one file per package type per directory. It queries the registered builders and parsers to figure out what files are supported. Parsers are queried first, so that ready-to-parse lock files win over dependency files (which require some kind of build).

The walk function is configurable, and it can optionally stop right after the first match. This way gemnasium-maven and gemnasium-python can behave like they currently do.

The process function replaces ScanDir, ScanFile, and ScanReader, which are no longer needed.

Ideally, the main.go of a Gemnasium project is simple as:

  1. importing builder, parser, and vrange plugins as anonymous
  2. creating a NewApp using the gemnasium/cli package, and running this app

Implementation plan

Improvements

  • codebase is ready to be merged
  • codebase is more consistent, making contribution easier

Risks

None identified

Involved components

Optional: Intended side effects

Optional: Missing test coverage

None

Edited by Fabien Catteau