Read bucket labels to skip listObjects in package metadata sync
Background
The package metadata sync runs as a Sidekiq cron job (PackageMetadata::LicensesSyncWorker, PackageMetadata::AdvisoriesSyncWorker). Each run iterates over all permitted purl types (maven, npm, pypi, composer, etc.), and for each one the GCP connector calls data_after(checkpoint) to determine what to sync. Previously, this always called listObjects to enumerate every file under the prefix (e.g. v2/packagist/), then walked through them to find the checkpoint position. For registries with thousands of files, this is expensive and slow.
What does this MR do and why?
This MR reads GCP bucket labels (written by the exporter in license-exporter!220) to determine available deltas without calling listObjects. Falls back to the existing listObjects path when labels are absent or insufficient.
References
Screenshots or screen recordings
Not applicable - backend only change.
How to set up and validate locally
- Run the package metadata sync with a v2 purl type that has bucket labels set
- Observe that when the checkpoint is at or past the newest label timestamp, no
listObjectscall is made - Observe fallback to
listObjectswhen labels are missing
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.