Support advisories and affected packages data sync protocol

Why are we doing this work

A new version format is needed for advisory ingestion. The monolith sync service needs to be able to use this format.

Background

The external license database exports a set of deltas representing its internal dataset over time. A delta is written to a gcp bucket as a set of files at a particular timestamp. The timestamp is the identifier for that delta dataset. The data for a particular dataset is written as a set of chunks which have an upper limit to their size.

As an example:

If data coming into the external license db looks like the following:

data at t1
- rails,[6.1,6.2],MIT
data at t2
- rails,[6.3],MIT

Then the exporter writes this to the gcp bucket:

at t1 v1/rubygem/t1/file.csv
- contents of csv are
  - rails,6.1,MIT
  - rails,6.2,MIT
at t2 v1/rubygem/t2/file.csv
- contents of csv are
  - rails,6.3,MIT

This format allows both the producer and consumers to be stateless (aside from storing the last synced timestamp).

Monolith Sync

The monolith uses checkpoints to store the last synced position. If a checkpoint exists (sequence and chunk match), only the files after this checkpoint are fetched.

The connectors instantiate a CsvFile which is a simple enumerable container responsible for offering a lazy enum interface and parsing the csv data into a DataObject.

After ingestion is fully finished, the new checkpoint is saved https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/services/package_metadata/sync_service.rb#L50

Changes

The identifier for this new format is v2 and is part of the path locating file chunks. The following are changed.

URL
Storage format
Object format

1. URL changes

data_type is added to the url, going from: v1/<purl_type>/<timestamp>/<chunk>.csv to v2/<purl_type>/[advisories|licenses]/<timestamp>/<chunk>.ndjson.

2. Storage format

The storage format has been changed from csv to ndjson.

3. Object format

The object is a json with the following fields:

id - unique identifier for the advisory
database - indicating which database this advisory came from
advisory - stores contents of the advisory data
packages - stores the packages affected by this advisory and ranges affected

The fields for advisory and packages are specified in PackageMetadata::Advisory and PackageMetadata::AffectedPackage.

Example:

{
  "advisory": {
      "id": "CVE-2022-40303",
      "database": "trivy-db",
      "title": "",
      "description": "...",
      "cvss_v3": "CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H",
      ...
  }
  "packages": [
     {
       "name": "libxml2",
       "purl_type": "deb",
       "dist_version": "10",
       "affected_range": "<2.9.4+dfsg1-7+deb10u5",
       "severity": "..."
     },
     {
       // ...
     }
  ]
}

Relevant links

version format discussion #370780 (closed)
research spike #394723 (closed)

Non-functional requirements

Documentation: n/a
Feature flag: n/a
Performance: n/a
Testing: n/a

Implementation plan

add sync config for advisories
- add advisories specific data (bucket, offline location, etc.)
add data objects
update data object fabrication

Below is the old implementation plan which was superseded with above after most of the needed functionality was added in Refactor interface between sync protocol and da... (!120795 - merged)

Old implementation plan

Update checkpoint

create migration to add version_format and data_type to checkpoints

Update connectors (work ongoing in Refactor interface between sync protocol and da... (!120795 - merged))

extract common CsvFile functionality out of offline and gcp connectors and change this class to DataFile
update both connectors to accept data_type and select the correct url/path based on it
update connector iterators to instantiate a DataFile with data_type (e.g. gcp)
update DataFile to accept a data_type parameter so as to determine file suffix (e.g. for gcp)
- offline archive_path
- gcp file_prefix

Update data parsing

rename PackageMetadata::DataObject to PackageMetadata::LicenseDataObject
add new object PackageMetadata::AdvisoryDataObject with fields to populate PackageMetadata::Advisory and PackageMetadata::AffectedPackage (similar to PackageMetadata::LicenseDataObject)
rename .from_csv to .parse https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/services/package_metadata/data_object.rb#L12
update.parse to support json as well as csv https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/services/package_metadata/data_object.rb#L12 based on data_type supplied by connector

Edited Jul 05, 2023 by Igor Frenkel