Spike: PackageMetadata sync data format v3
Note
This is still in draft
Problem to Solve
PackageMetata sync is integral to SCA capabilites for license identification, vulnerability scanning, and prioritizing vulnerabilities (e.g. cvss vectors). Even though it's worked fairly reliably over the last few major milestones, we have seen several limitations repeatedly which we struggle to fix using the current data format.
Criteria for PackageMetadata Synchronization
Some of the criteria here are not technically difficult on their own, but face limitations based on the unique needs of GitLab instances. These include the use of rails backend for GitLab instances, the ACID data model, and limitations on the different types of instance installs
- Use data at rest model: package_metadata lives in GitLab postgresql database, ingestion uses an upsert mechanism, re-upserting takes a long time
- DB initial import time increaes
- Costs of bucket use on List operations - compaction
- JSON schema versioning
Data Format Limitations
v1
Used additive "everything in the bucket" model. Streamed updates to the instance database directly and contained no workarounds for compression version ranges. However it quickly blew up to an unmanageable size.
v2
- Added advisories as a data source.
- Used range compression for licenses.
- Switched to ndjson as a format.
Limitations of version_format v2
Initial sync
- Data provides deltas to instance that have synced, but it's more of a data partition than a true delta mechanism.
Schema updates
- cvss_v4 discussion has shown the problem with using an approach for an assumed schema structure.
considered approaches
- json format in bucket similar to v2
- sqlite - TBC
- oras container - TBC
- WAL format - TBC
Evaluated: Approach #1 (closed) - Improved json format
Definition
For the main reasons of streaming updates of primary sources, structured data approach, and control over the exporter, improving on the current approach is promising. The improvements involve keeping the full dataset in the bucket and a limited number of deltas, using a flexible but "removals restricted" schema, and taking advantage of the asynchronous nature of PackageMetadata Sync by compressing the data aggressively.
Detail
- Keep delta method to help instances upsert, but limit it to last a small number of slices.
- Always allow instance to fully re-sync by also keeping the full compacted data set with the deltas.
- Use a manifest to describe what's in the data and allow instances to selectively pull files rather than listing contents.
- Schema is "additional properties only". New attributes are immediately reflected in the delta.
Conclusions
For a full conclusion on the v3 format please refer to the ADR document.
Solutions regarding storing the data (not strictly related to v3)
| Solution | Description | Pros | Cons |
|---|---|---|---|
|
CloudFlare |
manifest + deltas + full_export all in CF R2 bucket |
|
|
|
GCP + External Location |
|
|
|
|
GCP with ETag Caching |
|
|
Still pays egress on the first fetch and when manifest.json changes |
|
GCP with Labels |
bucket labels + manifest.json + deltas + full_export all in GCP |
|
|