Engineering discovery: Data need to be stored for dependencies information on the way to Database

Problem

Today, it is difficult if not impossible for us to add some features because of the way and amount of data that we store about packages, project dependencies, and related data.

Note

What was discovered in a call - Nicole includes object storage as a database - So include below what data storage / database we want to use

Ask

I would like for us to think about what data is needed in the longer term to deliver to customers the features they are asking for. Then pick one piece of that data, and keeping in mind the longer term goals, design and do a POC to add some of the data that would give us one or more of the customer features listed below (also feel free to add features!)

PS - you may want to look at the epic - &5199 (closed)

Outcome

A short term proposal for additional data to be stored and a longer term list of data that we would need to have in order to be able to accomplish all the features we currently would like to implement.

We should review the design with Database Engineering and ~"Category:Vulnerability Database" for validation of both short and long term goals as we know and understand them today. We may also identify other teams that share some of our short or long term data goals and we should identify that to coordinate where possible on those items.

In the end we will only implement a small subset of tables and columns needed with the intent that considering the future we will reduce the number of migrations or other painful steps as we enhance the database to add all the features over time.

Customer Features

upcoming customer features I want soonest - and order will be decided by this proof of concept/plan

  • spdx/sbom
  • alerting users if the location changes (dependency confusion)
  • looking at pulling in package hunter
  • alerting a new cve on existing dependency
  • grouping findings as from same introduced dependency

Information about project dependencies

Tracking the source of a package

  • (new) where it usually comes from (spot dependency confusion/change)

Explaining how a dependency relates to the project itself

  • Show (one of) dependency paths (for the purpose of looking at a dependency or a vulnerability and going "where did this come from?")
  • Show (many/all) dependency paths? (is this valuable?)
  • Show dependency graph

Querying dependencies

  • As a customer I want to know what dependencies (and versions) are in each of my projects, and their licenses (list and search)
  • As a customer I want to know what dependencies (and versions) are in each of my GROUPS of projects, and their licenses (list and search)
  • As a customer I want to know what dependencies (and versions) are in each of my INSTANCES, and their licenses and dependency (and version) (list and search)
  • As a customer i may wish to associate versions of my list of dependencies to versions so if one of my users asks for a license list in a specific (older) version i can provide that to them.

Tracking changes in dependencies

  • As a customer i may wish to be able to see dependencies merged into "main" versus "feature branches"

Exporting a SBoM

  • being able to easily generate project, or workspace level exports of all dependencies (SBOM)
  • we'll want these to be associated with licenses as much as possible (for SBOM)

Notify about new vulnerabilities affecting project dependencies

(previously was "the relationship of project dependencies to vulnerabilities")

  • Alert/inform customers when existing dependencies have newly identified findings (i.e. compare Dependency List vs newly added entries to vuln DB) ideally without needing a pipeline (just maybe a chron job or something). we may need to assume the vuln db is also a real db for this feature.

Notify about important changes affecting project dependencies

(previously was "package data (info, risk, etc)")

  • We want to be able to inform users of new risks (EOL) and opportunities (Newly released version (minor or major)) again hopefully without needing a pipeline
  • we want to inform users of other things (changed maintainer) that we find of note related to their dependencies

policies

  • we want to be able to at any time query out of compliance dependencies and if they were approved or not as exceptions (i.e. added before policies put in place for example) i.e. these have critial vulns and we don't allow that - these have denied licenses - and this list should say what dependency, version, project(s), and who/when the exception was made (ref an MR approval or MR whichever has an audit log?)
  • customers should be able to add a list of denied and allowed dependencies (by specific version, all version, equal to or earlier version than X) which would trigger MR approvals, as well as the exception reports above, AND be able to be shared (the allow/deny list) with Category:Dependency Firewall and Category:Dependency Proxy and be able to be applied at a project, group, tag, or instance level (aka they should create lists, then have a mechanism to apply that policy list that can change as needed and reference that policy list as a whole set) i.e. should also be able to exempt one occurance

Considerations

  • must work self hosted and saas
  • most work offline
  • space should be considered what is the minimal data we can store and still accomplish our goals.
  • must be permissioned properly so that no one can edit/add/remove/view data unless that is an expected or known condition - we may wish to wok with permissions team as permissions are not yet RBAC/granular and we may wish to implement separate persmissioning on top
  • needs to have reasonable call limits to prevent abuse
  • needs to be performant (may occur after spike but we may need to index or other things and caching?)
  • we will need to report on the used size for the charge out / limits projects
Edited by Olivier Gonzalez