Spike: How do we sync the backend with a source of advisories and affected versions?
Time-boxed: 3 days
Topic to Evaluate
As part Dependency Scanning: CVS Trigger scans on Advis... (&9534 - closed), we need to evaluate the feasibility of a sync protocol the backend who use to get security advisories from an external service. The sync protocol must support the following scenarios:
- Import advisories with sets of affected versions.
- Import changes in the description of an advisory.
- Import changes in the affected versions for existing security advisories.
- New versions are available, and they are affected.
- Versions that have already been exported become affected.
- Versions that have already been exported are no longer affected.
(When importing changes in the affected versions, the backend respond by adding or removing vulnerabilities to the projects referencing these versions.)
Problem discussion
The need for defining a sync protocol comes from the fact that package registries constantly add data and the producer external license database
needs to transfer the data to the GitLab monolith each time.
Transferring the full dataset each time would still be a problem even if transfer-time or resource-usage was not an issue. Under the current (v1
) protocol scheme an initial sync (full dataset) has been observed to take on the order of hours and has a large impact on database usage with the current ingestion (upsert) scheme.
The current v1
version of the package metadata sync (spike) sync between monolith package metadata ... (#379137 - closed) represents changes to the dataset as deltas from the last time the producer has exported the data. The consumer just needs to resume from its last sync checkpoint to get the changes.
The problem with v1
is that it is currently only able to capture data being added. If data is updated or removed, there is no way to represent this.
Proposal
A couple of options are viable as an MVC:
- extend the csv deltas format to incorporate data changes
- change the protocol to represent the complete dataset in a (file/tree) structured way and represent changes via a separate
change manifest
Both options will have to add a new type and an update to the path seems most convenient where the current license dataset would go under <version>/<purl_type>/licenses/<sequence>/<chunk>
and the new data set (affected versions) would go under <version>/<purl_type>/advisories/<sequence>/<chunk>
.
Option 1: extend the delta csv format
For advisories
the 3rd row (currently storing licenses) can be repurposed to
store CVEs:
rails,6.1,CVE-1
rails,6.1,CVE-2
rails,6.1,CVE-3
Data for a particular package
-version
combination would be combined under a single record and the information packed into the last column (currently storing a single license): rails,6.1,"CVE-1,CVE-2,CVE-3"
or rails,6.1,"MIT,Apache"
.
- pros
- producer doesn't have to keep state just dump out all data for a changed record
- cons
- very large rows possible
- lots of data to read to make one small change
Note: there is another option of using the csv format as a transaction log (e.g. indicating whether a given record was added, deleted, updated, etc.). But this imposes high burden on both producers and consumers. The producer needs to track the change internally (e.g. this is a deletion because the record exists). And the consumer will have to consume lots of unnecessary data if it has an outdated dataset (e.g. for a single row, process an addition, then a deletion, then another update). This can be considered if updates are few.
Note #2 (closed): another option is to use a more structured format like json
in order to not have to pack semantic data into csv rows (e.g. compressing several licenses into a field but having to escape the csv delimiter)
Option 2: switch to a format representing the full dataset
This means representing the data in a file-like structure. For example:
.
- v2
- gem
- rails
- 6.1
- licenses
- advisories
- 6.2
- licenses
- advisories
- rspec-core
- 3.10.0
- licenses
- advisories
...
The exact structure (e.g. splitting licenses and advisories by version vs by package) is still to be explored and can be determined on the cardinality of the datasets.
When a change is made, the producer makes the change to only the affected path and writes the affected path to a changeset. This scheme simplifies several things. The producer always stores the latest representation of the data in the bucket but the changes it has to make are quite small and limited to only what has changed.
.
- v2
- change_manifest
- c1
- *
- c2.json
- contents: { paths: ['v2/dataset/gem/rails/6.2'] }
- dataset
- change_id=c2
- gem
- rails
- versions
- 6.1
- advisories
- licenses
- 6.2
- advisories
- licenses
- golang
As an example, if rails added a new version (6.2) and its licenses: The producer
does
the following:
- add prefix 6.2 and its data under
v2/dataset/gem/rails
- generate a new change_id (
c2
) - update
v2/dataset/gem/change_id
toc2
- add
c2
tov2/change_manifest
with the path that was changed
When the consumer
needs to sync its dataset it:
- fetches the last checkpoint or
change_id
it stored (c1
) - fetches the current
change_id
inv2/dataset/gem/change_id
- finds the changes in
v2/change_manifest
- iterates over the changed paths, syncing data just for what was synced
It is likely overkill to have a file per version and it is much likelier that we would have a json
file per package. Large packages can be split up into files using some part of the version (e.g. the major).
pros
- producer does not need to be aware of state outside of last change written
- consumer only needs to store the change_id
- granularity of the change controllable (e.g. can use
v2/dataset/gem/rails
to indicate that the whole package changed rather than just a version) - for initial sync or a missing changeset the consumer reads all of
v2/dataset
and always arrives at the same state as a consumer using a delta betwen two change_ids (this is similar to the current scheme)
cons
- more complicated protocol
- the consumer needs to figure out how to reconcile the current dataset state against what it has stored
Note: git
would be a decent candidate here (as this scheme is just a
simplified version control) but there is no easy way to store
the repo in a gcp bucket.
Conclusion: reuse the delta format with more structured data
Option 1 with the ndjson
variation is chosen. It is a smaller iteration on the original format and will allow us to store advisories
as well as the full license dataset. See more benefits in this thread (referred to as option 3).
The changed format will be called v2
and will continue to represent the delta of the change in the data corpus since last update.
Protocol format:
- version string for this protocol is
v2
. - file format used is ndjson.
- url layout for delta files is:
v2/<purl_type>/[advisories|licenses]/<timestamp>/<chunk>.ndjson
- line format contains a unique package
name
andversion
as well as the complete dataset of data belonging to this tuple- as an example, if a package-version (e.g.
rails-6.1.1
) gets a new license, the whole license set will be represented in the data) - first entry:
{ name: "rails", version: "6.1.1", licenses: ['mit']}
- subsequent entry after license change (e.g.
apache
added):{ name: "rails", version: "6.1.1", licenses: ['mit', 'apache']}
- as an example, if a package-version (e.g.
- each line in
advisories/<timestamp>/<chunk>.ndjson
will represent all data for a single advisory for a package, e.g.{ name: "rails", affected_range: ">=1.1.0 <1.1.6", identifier: "CVE-2006-4111", etc. }
Note regarding <purl_type>
in the url: in v1
this url fragment was not a true purl_type but rather the internal name of the package registry. v2
will emit the correct purl_type
: https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/models/package_metadata/sync_configuration.rb#L8
At this time, affected versions are skipped from the implementation.