Spike: PackageMetadata sync data format v3

Note

This is still in draft

Problem to Solve

PackageMetata sync is integral to SCA capabilites for license identification, vulnerability scanning, and prioritizing vulnerabilities (e.g. cvss vectors). Even though it's worked fairly reliably over the last few major milestones, we have seen several limitations repeatedly which we struggle to fix using the current data format.

Criteria for PackageMetadata Synchronization

Some of the criteria here are not technically difficult on their own, but face limitations based on the unique needs of GitLab instances. These include the use of rails backend for GitLab instances, the ACID data model, and limitations on the different types of instance installs

  • Use data at rest model: package_metadata lives in GitLab postgresql database, ingestion uses an upsert mechanism, re-upserting takes a long time
  • DB initial import time increaes
  • Costs of bucket use on List operations - compaction
  • JSON schema versioning

Data Format Limitations

v1

Used additive "everything in the bucket" model. Streamed updates to the instance database directly and contained no workarounds for compression version ranges. However it quickly blew up to an unmanageable size.

v2

  • Added advisories as a data source.
  • Used range compression for licenses.
  • Switched to ndjson as a format.

Limitations of version_format v2

Initial sync

  • Data provides deltas to instance that have synced, but it's more of a data partition than a true delta mechanism.

Schema updates

  • cvss_v4 discussion has shown the problem with using an approach for an assumed schema structure.

considered approaches

  1. json format in bucket similar to v2
  2. sqlite - TBC
  3. oras container - TBC
  4. WAL format - TBC

Evaluated: Approach #1 (closed) - Improved json format

Definition

For the main reasons of streaming updates of primary sources, structured data approach, and control over the exporter, improving on the current approach is promising. The improvements involve keeping the full dataset in the bucket and a limited number of deltas, using a flexible but "removals restricted" schema, and taking advantage of the asynchronous nature of PackageMetadata Sync by compressing the data aggressively.

Detail

  • Keep delta method to help instances upsert, but limit it to last a small number of slices.
  • Always allow instance to fully re-sync by also keeping the full compacted data set with the deltas.
  • Use a manifest to describe what's in the data and allow instances to selectively pull files rather than listing contents.
  • Schema is "additional properties only". New attributes are immediately reflected in the delta.

Conclusions

For a full conclusion on the v3 format please refer to the ADR document.

Solution Description Pros Cons

CloudFlare

manifest + deltas + full_export all in CF R2 bucket
  • No egress costs at all
  • One big manifest.json with many timestamps
  • Simple
  • No labels support (but don't need them anyway)
  • Less familiar to the team;
  • Rest of PMDB infra remains in GCP (for now eventually we migrate)

GCP + External Location

  • deltas + full_export in GCP
  • manifest.json in external location (for example git repo)
  • No egress costs for manifest.json
  • manifest.json can be public (no sensitive data)
  • Fragmentation - maintain both GCP and external location
  • External resource availability risk (this is low risk but might become difficult to calculate SLAs in the future

GCP with ETag Caching

  • Instances store manifest's ETag, and issue GET with If-None-Match
  • server returns 304 Not Modified if unchanged
  • REST-ful and standard HTTP
  • CDN-cacheable
  • Transferable to other systems
  • Minimal egress when manifest unchanged
Still pays egress on the first fetch and when manifest.json changes

GCP with Labels

bucket labels + manifest.json + deltas + full_export all in GCP
  • Reduces manifest.json fetches
  • Doesn't eliminate egress; Adds complexity (labels + manifest + fallback logic)
  • Not portable to other platforms without bucket labels
  • If we add more PURL types in the future, we might hit the max labels limit (unlikely)
Edited by Nick Ilieskou