Spike: PackageMetadata sync data format v3 (#584273) · Issues · GitLab.org / GitLab

Spike: PackageMetadata sync data format v3

> [!note] > > This is still in draft ## Problem to Solve PackageMetata sync is integral to SCA capabilites for license identification, vulnerability scanning, and prioritizing vulnerabilities (e.g. cvss vectors). Even though it's worked fairly reliably over the last few major milestones, we have seen several limitations repeatedly which we struggle to fix using the current data format. ### Criteria for PackageMetadata Synchronization Some of the criteria here are not technically difficult on their own, but face limitations based on the unique needs of GitLab instances. These include the use of rails backend for GitLab instances, the ACID data model, and limitations on the different types of instance installs - Use data at rest model: package_metadata lives in GitLab postgresql database, ingestion uses an upsert mechanism, re-upserting takes a long time - DB initial import time increaes - Costs of bucket use on List operations - compaction - JSON schema versioning ### Data Format Limitations #### v1 Used additive "everything in the bucket" model. Streamed updates to the instance database directly and contained no workarounds for compression version ranges. However it quickly blew up to an unmanageable size. #### v2 - Added advisories as a data source. - Used range compression for licenses. - Switched to ndjson as a format. ### Limitations of version_format v2 #### Initial sync - Data provides deltas to instance that have synced, but it's more of a data partition than a true delta mechanism. #### Schema updates - cvss_v4 discussion has shown the problem with using an approach for an assumed schema structure. ### considered approaches 1. json format in bucket similar to v2 2. sqlite - TBC 3. oras container - TBC 4. WAL format - TBC ### Evaluated: Approach #1 - Improved json format #### Definition For the main reasons of streaming updates of primary sources, structured data approach, and control over the exporter, improving on the current approach is promising. The improvements involve keeping the full dataset in the bucket and a limited number of deltas, using a flexible but "removals restricted" schema, and taking advantage of the asynchronous nature of PackageMetadata Sync by compressing the data aggressively. #### Detail - Keep delta method to help instances upsert, but limit it to last a small number of slices. - Always allow instance to fully re-sync by also keeping the full compacted data set with the deltas. - Use a manifest to describe what's in the data and allow instances to selectively pull files rather than listing contents. - Schema is "additional properties only". New attributes are immediately reflected in the delta. ## Conclusions For a full conclusion on the v3 format please refer to the [ADR](https://gitlab.com/gitlab-com/content-sites/handbook/-/merge_requests/18229) document. ### Solutions regarding storing the data (not strictly related to v3) <table> <tr> <th>Solution</th> <th>Description</th> <th>Pros</th> <th>Cons</th> </tr> <tr> <td> **CloudFlare** </td> <td>manifest + deltas + full_export all in CF R2 bucket</td> <td> * No egress costs at all * One big manifest.json with many timestamps * Simple </td> <td> * No labels support (but don't need them anyway) * Less familiar to the team; * Rest of PMDB infra remains in GCP (for now eventually we migrate) </td> </tr> <tr> <td> **GCP + External Location** </td> <td> * deltas + full_export in GCP * manifest.json in external location (for example git repo) </td> <td> * No egress costs for manifest.json * manifest.json can be public (no sensitive data) </td> <td> * Fragmentation - maintain both GCP and external location * External resource availability risk (this is low risk but might become difficult to calculate SLAs in the future </td> </tr> <tr> <td> **GCP with ETag Caching** </td> <td> * Instances store manifest's ETag, and issue GET with If-None-Match * server returns 304 Not Modified if unchanged </td> <td> * REST-ful and standard HTTP * CDN-cacheable * Transferable to other systems * Minimal egress when manifest unchanged </td> <td>Still pays egress on the first fetch and when manifest.json changes</td> </tr> <tr> <td> **GCP with Labels** </td> <td>bucket labels + manifest.json + deltas + full_export all in GCP</td> <td> * Reduces manifest.json fetches </td> <td> * Doesn't eliminate egress; Adds complexity (labels + manifest + fallback logic) * Not portable to other platforms without bucket labels * If we add more PURL types in the future, we might hit the max labels limit (unlikely) </td> </tr> </table>

issue