Spike: How do we sync the backend with a source of advisories and affected versions?

Time-boxed: 3 days

Topic to Evaluate

As part Dependency Scanning: CVS Trigger scans on Advis... (&9534 - closed), we need to evaluate the feasibility of a sync protocol the backend who use to get security advisories from an external service. The sync protocol must support the following scenarios:

Import advisories with sets of affected versions.
Import changes in the description of an advisory.
Import changes in the affected versions for existing security advisories.
- New versions are available, and they are affected.
- Versions that have already been exported become affected.
- Versions that have already been exported are no longer affected.

(When importing changes in the affected versions, the backend respond by adding or removing vulnerabilities to the projects referencing these versions.)

Problem discussion

The need for defining a sync protocol comes from the fact that package registries constantly add data and the producer external license database needs to transfer the data to the GitLab monolith each time.

Transferring the full dataset each time would still be a problem even if transfer-time or resource-usage was not an issue. Under the current (v1) protocol scheme an initial sync (full dataset) has been observed to take on the order of hours and has a large impact on database usage with the current ingestion (upsert) scheme.

The current v1 version of the package metadata sync (spike) sync between monolith package metadata ... (#379137 - closed) represents changes to the dataset as deltas from the last time the producer has exported the data. The consumer just needs to resume from its last sync checkpoint to get the changes.

The problem with v1 is that it is currently only able to capture data being added. If data is updated or removed, there is no way to represent this.

Proposal

A couple of options are viable as an MVC:

extend the csv deltas format to incorporate data changes
change the protocol to represent the complete dataset in a (file/tree) structured way and represent changes via a separate change manifest

Both options will have to add a new type and an update to the path seems most convenient where the current license dataset would go under <version>/<purl_type>/licenses/<sequence>/<chunk> and the new data set (affected versions) would go under <version>/<purl_type>/advisories/<sequence>/<chunk>.

Option 1: extend the delta csv format

For advisories the 3rd row (currently storing licenses) can be repurposed to store CVEs:

rails,6.1,CVE-1
rails,6.1,CVE-2
rails,6.1,CVE-3

Data for a particular package-version combination would be combined under a single record and the information packed into the last column (currently storing a single license): rails,6.1,"CVE-1,CVE-2,CVE-3" or rails,6.1,"MIT,Apache".

pros
- producer doesn't have to keep state just dump out all data for a changed record
cons
- very large rows possible
- lots of data to read to make one small change

Note: there is another option of using the csv format as a transaction log (e.g. indicating whether a given record was added, deleted, updated, etc.). But this imposes high burden on both producers and consumers. The producer needs to track the change internally (e.g. this is a deletion because the record exists). And the consumer will have to consume lots of unnecessary data if it has an outdated dataset (e.g. for a single row, process an addition, then a deletion, then another update). This can be considered if updates are few.

Note #2 (closed): another option is to use a more structured format like json in order to not have to pack semantic data into csv rows (e.g. compressing several licenses into a field but having to escape the csv delimiter)

Option 2: switch to a format representing the full dataset

This means representing the data in a file-like structure. For example:

.
  - v2
    - gem
      - rails
        - 6.1
          - licenses
          - advisories
        - 6.2
          - licenses
          - advisories
      - rspec-core
        - 3.10.0
          - licenses
          - advisories
      ...

The exact structure (e.g. splitting licenses and advisories by version vs by package) is still to be explored and can be determined on the cardinality of the datasets.

When a change is made, the producer makes the change to only the affected path and writes the affected path to a changeset. This scheme simplifies several things. The producer always stores the latest representation of the data in the bucket but the changes it has to make are quite small and limited to only what has changed.

.
  - v2
    - change_manifest
      - c1
        - *
      - c2.json
        - contents: { paths: ['v2/dataset/gem/rails/6.2'] }
    - dataset
      - change_id=c2
      - gem
        - rails
          - versions
            - 6.1
              - advisories
              - licenses
            - 6.2
              - advisories
              - licenses
      - golang

As an example, if rails added a new version (6.2) and its licenses: The producer does the following:

add prefix 6.2 and its data under v2/dataset/gem/rails
generate a new change_id (c2)
update v2/dataset/gem/change_id to c2
add c2 to v2/change_manifest with the path that was changed

When the consumer needs to sync its dataset it:

fetches the last checkpoint or change_id it stored (c1)
fetches the current change_id in v2/dataset/gem/change_id
finds the changes in v2/change_manifest
iterates over the changed paths, syncing data just for what was synced

It is likely overkill to have a file per version and it is much likelier that we would have a json file per package. Large packages can be split up into files using some part of the version (e.g. the major).

pros

producer does not need to be aware of state outside of last change written
consumer only needs to store the change_id
granularity of the change controllable (e.g. can use v2/dataset/gem/rails to indicate that the whole package changed rather than just a version)
for initial sync or a missing changeset the consumer reads all of v2/dataset and always arrives at the same state as a consumer using a delta betwen two change_ids (this is similar to the current scheme)

cons

more complicated protocol
the consumer needs to figure out how to reconcile the current dataset state against what it has stored

Note: git would be a decent candidate here (as this scheme is just a simplified version control) but there is no easy way to store the repo in a gcp bucket.

Conclusion: reuse the delta format with more structured data

Option 1 with the ndjson variation is chosen. It is a smaller iteration on the original format and will allow us to store advisories as well as the full license dataset. See more benefits in this thread (referred to as option 3).

The changed format will be called v2 and will continue to represent the delta of the change in the data corpus since last update.

Protocol format:

version string for this protocol is v2.
file format used is ndjson.
url layout for delta files is: v2/<purl_type>/[advisories|licenses]/<timestamp>/<chunk>.ndjson
line format contains a unique package name and version as well as the complete dataset of data belonging to this tuple
- as an example, if a package-version (e.g. rails-6.1.1) gets a new license, the whole license set will be represented in the data)
- first entry: { name: "rails", version: "6.1.1", licenses: ['mit']}
- subsequent entry after license change (e.g. apache added): { name: "rails", version: "6.1.1", licenses: ['mit', 'apache']}
each line in advisories/<timestamp>/<chunk>.ndjson will represent all data for a single advisory for a package, e.g. { name: "rails", affected_range: ">=1.1.0 <1.1.6", identifier: "CVE-2006-4111", etc. }

Note regarding <purl_type> in the url: in v1 this url fragment was not a true purl_type but rather the internal name of the package registry. v2 will emit the correct purl_type: https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/models/package_metadata/sync_configuration.rb#L8

At this time, affected versions are skipped from the implementation.

Edited Mar 31, 2023 by Igor Frenkel