Support advisories and affected packages data sync protocol

Why are we doing this work

A new version format is needed for advisory ingestion. The monolith sync service needs to be able to use this format.

Background

The external license database exports a set of deltas representing its internal dataset over time. A delta is written to a gcp bucket as a set of files at a particular timestamp. The timestamp is the identifier for that delta dataset. The data for a particular dataset is written as a set of chunks which have an upper limit to their size.

As an example:

If data coming into the external license db looks like the following:

  • data at t1
    • rails,[6.1,6.2],MIT
  • data at t2
    • rails,[6.3],MIT

Then the exporter writes this to the gcp bucket:

  • at t1 v1/rubygem/t1/file.csv
    • contents of csv are
      • rails,6.1,MIT
      • rails,6.2,MIT
  • at t2 v1/rubygem/t2/file.csv
    • contents of csv are
      • rails,6.3,MIT

This format allows both the producer and consumers to be stateless (aside from storing the last synced timestamp).

Monolith Sync

The monolith uses checkpoints to store the last synced position. If a checkpoint exists (sequence and chunk match), only the files after this checkpoint are fetched.

The connectors instantiate a CsvFile which is a simple enumerable container responsible for offering a lazy enum interface and parsing the csv data into a DataObject.

After ingestion is fully finished, the new checkpoint is saved https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/services/package_metadata/sync_service.rb#L50

Changes

The identifier for this new format is v2 and is part of the path locating file chunks. The following are changed.

  1. URL
  2. Storage format
  3. Object format

1. URL changes

data_type is added to the url, going from: v1/<purl_type>/<timestamp>/<chunk>.csv to v2/<purl_type>/[advisories|licenses]/<timestamp>/<chunk>.ndjson.

2. Storage format

The storage format has been changed from csv to ndjson.

3. Object format

The object is a json with the following fields:

  • id - unique identifier for the advisory
  • database - indicating which database this advisory came from
  • advisory - stores contents of the advisory data
  • packages - stores the packages affected by this advisory and ranges affected

The fields for advisory and packages are specified in PackageMetadata::Advisory and PackageMetadata::AffectedPackage.

Example:

{
  "advisory": {
      "id": "CVE-2022-40303",
      "database": "trivy-db",
      "title": "",
      "description": "...",
      "cvss_v3": "CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H",
      ...
  }
  "packages": [
     {
       "name": "libxml2",
       "purl_type": "deb",
       "dist_version": "10",
       "affected_range": "<2.9.4+dfsg1-7+deb10u5",
       "severity": "..."
     },
     {
       // ...
     }
  ]
}

Relevant links

Non-functional requirements

  • Documentation: n/a
  • Feature flag: n/a
  • Performance: n/a
  • Testing: n/a

Implementation plan

  • add sync config for advisories
    • add advisories specific data (bucket, offline location, etc.)
  • add data objects
  • update data object fabrication

Below is the old implementation plan which was superseded with above after most of the needed functionality was added in Refactor interface between sync protocol and da... (!120795 - merged)

Old implementation plan

Update checkpoint

  • create migration to add version_format and data_type to checkpoints

Update connectors (work ongoing in Refactor interface between sync protocol and da... (!120795 - merged))

  • extract common CsvFile functionality out of offline and gcp connectors and change this class to DataFile
  • update both connectors to accept data_type and select the correct url/path based on it
  • update connector iterators to instantiate a DataFile with data_type (e.g. gcp)
  • update DataFile to accept a data_type parameter so as to determine file suffix (e.g. for gcp)

Update data parsing

Edited by Igor Frenkel