Use GCP bucket labels for PURL type timestamps to reduce listObject costs
Summary
Implement GCP bucket labels to store the latest delta timestamps per PURL type, allowing GitLab instances to avoid expensive listObject calls. This approach uses 1 label per PURL type with up to 5 timestamps each.
Goal
Decrease costs of public advisory and license data until we migrate to v3.
Background
This issue is derived from the discussion in #584273 (closed) (note 3081674002).
Currently, our public buckets incur significant costs from listObject operations (Class A operations). By using bucket labels to store the latest timestamps, we can:
- Replace expensive
listObjectcalls with cheaper label reads (Class B operations) - Significantly reduce egress bytes since label responses are much smaller than
listObjectresponses
Proposal
Label Structure
Each PURL type gets one label containing comma-separated timestamps. Since v2 can have one or more files per timestamp directory, we need to include the number of files (NoF) when there's more than one file:
Format: PURL_TYPE: timestamp:NoF,timestamp:NoF,timestamp:NoF,...
Where NoF (Number of Files) is omitted when it equals 1.
Example:
maven: 1771404692:3,1771404691,1771403692This maps to:
v2/maven/1771404692/00000001.ndjsonv2/maven/1771404692/00000002.ndjsonv2/maven/1771404692/00000003.ndjsonv2/maven/1771404691/00000001.ndjsonv2/maven/1771403692/00000001.ndjson
Constraints:
- GCP allows 64 labels per bucket
- Label values can be up to 63 characters
- Currently we have 16 PURL types, so we need 16 labels
- Not all PURL types will have the same number of timestamps - the exporter must check the 63-character limit and remove older timestamps as needed
Implementation
Exporter Side
- Update the exporter to set/update bucket labels when new deltas are created
- Include the number of files (
:NoF) suffix when a timestamp directory contains more than one file - Check if adding a new timestamp would exceed the 63-character limit per label
- If the limit is reached, remove the oldest timestamp(s) to make room for the new one
- Update labels atomically to avoid race conditions
Rails Side
- Read bucket labels first (Class B operation with minimal egress)
- Parse the label format to determine timestamps and file counts
- Use
listObjectonly as a fallback if labels are not available or don't contain the needed timestamp
Sync Flow
sequenceDiagram
participant GL as GitLab Instance
participant GCP as GCP Bucket
GL->>GCP: Read bucket labels
GCP-->>GL: Return labels (latest delta info)
alt Case 1: Target delta found in label
GL->>GCP: Fetch delta file(s)
GCP-->>GL: Return delta
GL->>GL: Apply delta
else Case 2: Target delta older than deltas in the label
GL->>GCP: listObject (fallback)
GCP-->>GL: Return file list
GL->>GL: Determine correct delta
GL->>GCP: Fetch delta file
GCP-->>GL: Return delta
GL->>GL: Apply delta
else Case 3: Instance already synced latest delta
GL->>GL: Instance fully up to date (no action)
endCost Impact
Reading labels is a Class B operation with minimal egress bytes (~959 bytes for all 16 PURL types with 3 timestamps each). This is significantly cheaper than the current listObject approach which generates large response payloads.
The Ruby gem we use supports this endpoint and it requires only the Storage Legacy Bucket Reader role which all our buckets already have.
Tasks
- Exporter: Update to write bucket labels with latest timestamps per PURL type (including file count when > 1)
- Exporter: Implement 63-character limit check and remove oldest timestamps when limit is reached
- Rails: Update sync mechanism to read labels first, parse the
timestamp:NoFformat, fallback tolistObjectonly when necessary - Clean up existing unrelated labels from buckets to ensure only PURL-related labels exist
Trade-offs
Pros:
- Low cost: Reading labels is a Class B operation
- Can read all labels in one operation
- Reduces egress bytes significantly
- Can be deployed now on v2 to see cost reduction
Cons:
- Vendor lock-in: Not all providers support labels (e.g., CloudFlare doesn't)
- Limited by 64 labels per bucket (not a real concern given current PURL type count)
- Still consumes some egress bytes (but much less than
listObject) - Variable number of timestamps per PURL type due to 63-character limit
Related Issues
- #584273 (closed) - Spike: PackageMetadata sync data format v3
- #584273 (comment 3081674002)