Support advisories and affected packages data sync protocol
Why are we doing this work
A new version format is needed for advisory ingestion. The monolith sync service needs to be able to use this format.
Background
The external license database exports a set of deltas representing its internal dataset over time. A delta is written to a gcp bucket as a set of files at a particular timestamp. The timestamp is the identifier for that delta dataset. The data for a particular dataset is written as a set of chunks which have an upper limit to their size.
As an example:
If data coming into the external license db looks like the following:
- data at t1
- rails,[6.1,6.2],MIT
- data at t2
- rails,[6.3],MIT
Then the exporter writes this to the gcp bucket:
- at t1
v1/rubygem/t1/file.csv- contents of csv are
- rails,6.1,MIT
- rails,6.2,MIT
- contents of csv are
- at t2
v1/rubygem/t2/file.csv- contents of csv are
- rails,6.3,MIT
- contents of csv are
This format allows both the producer and consumers to be stateless (aside from storing the last synced timestamp).
Monolith Sync
The monolith uses checkpoints to store the last synced position. If a checkpoint exists (sequence and chunk match), only the files after this checkpoint are fetched.
The connectors instantiate a CsvFile which is a simple enumerable container responsible for offering a lazy enum interface and parsing the csv data into a DataObject.
After ingestion is fully finished, the new checkpoint is saved https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/services/package_metadata/sync_service.rb#L50
Changes
The identifier for this new format is v2 and is part of the path locating file chunks. The following are changed.
- URL
- Storage format
- Object format
1. URL changes
data_type is added to the url, going from: v1/<purl_type>/<timestamp>/<chunk>.csv to v2/<purl_type>/[advisories|licenses]/<timestamp>/<chunk>.ndjson.
2. Storage format
The storage format has been changed from csv to ndjson.
3. Object format
The object is a json with the following fields:
-
id- unique identifier for the advisory -
database- indicating which database this advisory came from -
advisory- stores contents of the advisory data -
packages- stores the packages affected by this advisory and ranges affected
The fields for advisory and packages are specified in PackageMetadata::Advisory and PackageMetadata::AffectedPackage.
Example:
{
"advisory": {
"id": "CVE-2022-40303",
"database": "trivy-db",
"title": "",
"description": "...",
"cvss_v3": "CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H",
...
}
"packages": [
{
"name": "libxml2",
"purl_type": "deb",
"dist_version": "10",
"affected_range": "<2.9.4+dfsg1-7+deb10u5",
"severity": "..."
},
{
// ...
}
]
}
Relevant links
- version format discussion #370780 (closed)
- research spike #394723 (closed)
Non-functional requirements
- Documentation: n/a
- Feature flag: n/a
- Performance: n/a
- Testing: n/a
Implementation plan
-
add sync config for advisories - add advisories specific data (bucket, offline location, etc.)
-
add data objects -
update data object fabrication
Below is the old implementation plan which was superseded with above after most of the needed functionality was added in Refactor interface between sync protocol and da... (!120795 - merged)
Old implementation plan
Update checkpoint
-
create migration to add version_formatanddata_typeto checkpoints
Update connectors (work ongoing in Refactor interface between sync protocol and da... (!120795 - merged))
-
extract common CsvFilefunctionality out of offline and gcp connectors and change this class toDataFile -
update both connectors to accept data_typeand select the correcturl/pathbased on it -
update connector iterators to instantiate a DataFilewithdata_type(e.g. gcp) -
update DataFileto accept adata_typeparameter so as to determine file suffix (e.g. for gcp)- offline archive_path
- gcp file_prefix
Update data parsing
-
rename PackageMetadata::DataObject to PackageMetadata::LicenseDataObject -
add new object PackageMetadata::AdvisoryDataObjectwith fields to populatePackageMetadata::AdvisoryandPackageMetadata::AffectedPackage(similar toPackageMetadata::LicenseDataObject) -
rename .from_csvto.parsehttps://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/services/package_metadata/data_object.rb#L12 -
update .parseto supportjsonas well ascsvhttps://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/services/package_metadata/data_object.rb#L12 based ondata_typesupplied byconnector