Engineering Discovery: reconsider Gemnasium client/server architecture

Problem to solve

Maintaining the current architecture of Gemnasium is costly and without the necessary resources it also prevents us to improve the tool. On top of that, the current features provided by Gemnasium can possibly be achieved without that architecture.

Previous discussion about Gemnasium vision: https://docs.google.com/document/d/1pWP7pbSeoKwpcQ4lHAQ-wdpOhWuEVrNEl96Kq0ljYkY/edit#heading=h.vx2ew0wta8dm

Intended users

groupcomposition analysis

Further details

Current architecture presents some limitations:

we cannot provide security advisories for packages that are published to private registries or alternative public registries we don't support yet.
- private RubyGems or NPM server
- public maven registry other than maven central (example)
we cannot provide security advisories for removed packages (e.g. a yanked gem)
we cannot have security advisories for yanked versions (like a gem containing a backdoor)

Current architecture has a maintenance cost:

to maintain the advisories that get new affected or published versions after first being published
- affected ranges have to be re-evaluated to apply to new affected versions
- fixed versions cannot be set in the advisory if the package version is not yet available
to monitor, secure, and update the GCP installation (Google Cloud Platform)
to upgrade the package syncers used to synchronize package metadata with the various package registries currently supported, like rubygems.org or pypi.org; they change over time and we've got to keep up with these changes
to grant access to admins, keep track of the permissions, revoke access when needed

On top of resolving above issues, leveraging directly the gemnasium-db repository would provide some other benefits:

a simpler approach for air gaped configuration, no need to implement a sync mechanism with the Gemnasium server, the Git repository already provides the necessary tools
gemnasium-db can be forked, allowing customers to maintain their own security advisories on top of what's available in gemnasium-db
a simpler workflow since merging a MR in gemnasium-db as the immediate effect of publishing the advisory; there's no extra step, and there's no possible synchronization issues b/w gemnasium-db and the Gemnasium relational DB

Also, directly leveraging the gemnasium-db repository opens new possibilities by making some features a lot cheaper to implement:

add new fields the vulnerability, like the severity
support vendor package registries - this is a common need for PHP and Java
support new package managers and languages, like Go

Important: the goal here is not to throw away the server-side code, but to bypass it until we re-integrate or implement new features that need it.

Proposal

Consider modifying the Gemnasium client to use the YAML files of the gemnasium-db repository instead of the API served on https://deps.sec.gitlab.com.

Make sure existing feature and short/mid-term ones can all be achieved without the server-side; this could be achieved by prototyping a fork of analyzers/gemnasium that uses gemnasium-db instead of connection to the Gemnasium API.
- supports Dependency Scanning
- supports Dependency List
- supports Auto-Remediate (yarn)
Evaluate the cost of such a transition See https://gitlab.com/gitlab-org/gitlab-ee/issues/14630
- fits in one iteration
Consider de-provisioning the Gemnasium micro-services infrastructure. This implies to make sure all versions of the analyzer migrates to the new arch (on self-managed instances using an old version of GitLab). See https://gitlab.com/gitlab-org/gitlab-ee/issues/14692
Find out what mid/long-term features would benefit from the Gemnasium Server, and how Gemnasium Server could fit in a new architecture where gemnasium-db is the SST for the security advisories.
Documentation migration path and back-port strategy

What Gemnasium Server used to provide?

The Gemnasium Server (API, postgresql DB, services) used to run Gemnasium.com.

Gemnasium.com had two important features that are currently missing in GitLab:

package metadata; in the project dependency list, Gemnasium.com displayed detailed information about the package and package version
release notification; Gemnasium.com notified users about new versions of the packages their projects depend on

Technically, these features relied on micro-services responsible for tracking the package registries (like rubygems.org) and automatically fetching package metadata. Gemnasium Server relies on a messaging system (NSQ) to react to new package or version being published, fetch its metadata, and notify users.

Why are these features missing?

Though package metadata and release notification could exist in GitLab Dependency Scanning, they have to be re-implemented and even re-designed. That's because the technical context has dramatically changed; the architecture of GitLab Dependency Scanning is really different from the one of Gemnasium.com. To illustrate, the Gemnasium DB no longer has data relative to the users and their projects; this data now belongs to the GitLab DB, and we can't simply JOIN user projects (in GitLab DB) with updated packages (in Gemnasium RDB).

Can Gemnasium Server complement gemnasium-db?

Yes. There could be a separation of concern:

Gemnasium Server collects package metadata
gemnasium-db is the vulnerability database

We may need to make the two communicate to implement some advanced features, like vulnerability detection in new releases: the Gemnasium Server would detect that a new package version is vulnerability fix, and it would publish a MR in gemnasium-db, either directly or via an intermediary tool.

What about package metadata?

GitLab has a Dependency List but it shows no metadata for the packages and their versions. The missing data is exposed by the Gemnasium API but the Gemnasium analyzer/client doesn't fetch it (and these API endpoints are not public right now). But making the Gemnasium analyzer fetch the data seems like a bad idea because it would result in huge Dependency Scanning reports. A better approach would be to make the Rails backend merge this data when building the Dependency List. The backend would get the data from the Gemnasium Server - to be designed.

What about user notifications?

GitLab can already notify its users, and its DB contains information about the users, projects, and dependencies. The Rails backend needs to be notified by the Gemnasium Server about new package versions, so that it can react and ultimately notify the users - this would be similar to what used to happen within the Gemnasium Server itself, but adapted to a new technical environment.

Are we loosing the resources we've put into Gemnasium Server?

No. We're changing the purpose of Gemnasium Server: its task is to grab package metadata, and to notify GitLab about new package versions. But it doesn't host the vulnerability DB anymore, and probably won't have to interact with it in the short/medium-term.

*Don't we need a RDBMS for performance reasons?

bundler-audit only relies on YAML files, and it's proven to work.

Also, we can later on optimize the Gemnasium Analyzer using a RDBMS embedded in the Docker image. For instance, we could create and feed a sqlite DB when building the image, and the Gemnasium Analyzer would be able to query this DB at run-time.

What's the migration path? How to back-port?

Since gemnasium-db is becoming the SST, all supported versions of GitLab must upgrade to a version of Gemnasium that uses gemnasium-db, otherwise they wouldn't get the latest advisories. Ultimately versions of GitLab using the Gemnasium Server will stop functioning when the Gemnasium GCP installation is removed.

GitLab %11.7 and older use Dependency Scanning 2 (latest version of 2.x branch), which uses Gemnasium 2 (latest version of 2.x branch), so the change will be effective as a new version of Gemnasium 2 is published. There's no need to back-port anything.

GitLab %11.6 uses Dependency Scanning 1 (latest version of 1.x branch), which uses Gemnasium 1. A back-port is needed but the cost seems reasonable.

GitLab %11.5 and older use versions of DS that pre-date the Docker-based architecture. There Gemnasium exists in the from of a Ruby script called gemnasium.rb. In this context, the cost of the back-port seems prohibitive.

Links / references

Outcome

action items / results should be populated in https://gitlab.com/gitlab-org/gitlab-ee/issues/14630 and https://gitlab.com/gitlab-org/gitlab-ee/issues/14692

Edited Sep 06, 2019 by Fabien Catteau