Spike: PackageMetadata sync data format v3

Note

This is still in draft

Problem to Solve

PackageMetata sync is integral to SCA capabilites for license identification, vulnerability scanning, and prioritizing vulnerabilities (e.g. cvss vectors). Even though it's worked fairly reliably over the last few major milestones, we have seen several limitations repeatedly which we struggle to fix using the current data format.

Criteria for PackageMetadata Synchronization

Some of the criteria here are not technically difficult on their own, but face limitations based on the unique needs of GitLab instances. These include the use of rails backend for GitLab instances, the ACID data model, and limitations on the different types of instance installs

Use data at rest model: package_metadata lives in GitLab postgresql database, ingestion uses an upsert mechanism, re-upserting takes a long time
DB initial import time increaes
Costs of bucket use on List operations - compaction
JSON schema versioning

Data Format Limitations

v1

Used additive "everything in the bucket" model. Streamed updates to the instance database directly and contained no workarounds for compression version ranges. However it quickly blew up to an unmanageable size.

v2

Added advisories as a data source.
Used range compression for licenses.
Switched to ndjson as a format.

Limitations of version_format v2

Initial sync

Data provides deltas to instance that have synced, but it's more of a data partition than a true delta mechanism.

Schema updates

cvss_v4 discussion has shown the problem with using an approach for an assumed schema structure.

considered approaches

json format in bucket similar to v2
sqlite - TBC
oras container - TBC
WAL format - TBC

Evaluated: Approach #1 (closed) - Improved json format

Definition

For the main reasons of streaming updates of primary sources, structured data approach, and control over the exporter, improving on the current approach is promising. The improvements involve keeping the full dataset in the bucket and a limited number of deltas, using a flexible but "removals restricted" schema, and taking advantage of the asynchronous nature of PackageMetadata Sync by compressing the data aggressively.

Detail

Keep delta method to help instances upsert, but limit it to last a small number of slices.
Always allow instance to fully re-sync by also keeping the full compacted data set with the deltas.
Use a manifest to describe what's in the data and allow instances to selectively pull files rather than listing contents.
Schema is "additional properties only". New attributes are immediately reflected in the delta.

Conclusions

For a full conclusion on the v3 format please refer to the ADR document.

Solution	Description	Pros	Cons
CloudFlare	manifest + deltas + full_export all in CF R2 bucket	No egress costs at all One big manifest.json with many timestamps Simple	No labels support (but don't need them anyway) Less familiar to the team; Rest of PMDB infra remains in GCP (for now eventually we migrate)
GCP + External Location	deltas + full_export in GCP manifest.json in external location (for example git repo)	No egress costs for manifest.json manifest.json can be public (no sensitive data)	Fragmentation - maintain both GCP and external location External resource availability risk (this is low risk but might become difficult to calculate SLAs in the future
GCP with ETag Caching	Instances store manifest's ETag, and issue GET with If-None-Match server returns 304 Not Modified if unchanged	REST-ful and standard HTTP CDN-cacheable Transferable to other systems Minimal egress when manifest unchanged	Still pays egress on the first fetch and when manifest.json changes
GCP with Labels	bucket labels + manifest.json + deltas + full_export all in GCP	Reduces manifest.json fetches	Doesn't eliminate egress; Adds complexity (labels + manifest + fallback logic) Not portable to other platforms without bucket labels If we add more PURL types in the future, we might hit the max labels limit (unlikely)

Edited Feb 17, 2026 by Nick Ilieskou