Use GCP bucket labels for PURL type timestamps to reduce listObject costs

Summary

Implement GCP bucket labels to store the latest delta timestamps per PURL type, allowing GitLab instances to avoid expensive listObject calls. This approach uses 1 label per PURL type with up to 5 timestamps each.

Goal

Decrease costs of public advisory and license data until we migrate to v3.

Background

This issue is derived from the discussion in #584273 (closed) (note 3081674002).

Currently, our public buckets incur significant costs from listObject operations (Class A operations). By using bucket labels to store the latest timestamps, we can:

  • Replace expensive listObject calls with cheaper label reads (Class B operations)
  • Significantly reduce egress bytes since label responses are much smaller than listObject responses

Proposal

Label Structure

Each PURL type gets one label containing comma-separated timestamps. Since v2 can have one or more files per timestamp directory, we need to include the number of files (NoF) when there's more than one file:

Format: PURL_TYPE: timestamp:NoF,timestamp:NoF,timestamp:NoF,...

Where NoF (Number of Files) is omitted when it equals 1.

Example:

maven: 1771404692:3,1771404691,1771403692

This maps to:

  • v2/maven/1771404692/00000001.ndjson
  • v2/maven/1771404692/00000002.ndjson
  • v2/maven/1771404692/00000003.ndjson
  • v2/maven/1771404691/00000001.ndjson
  • v2/maven/1771403692/00000001.ndjson

Constraints:

  • GCP allows 64 labels per bucket
  • Label values can be up to 63 characters
  • Currently we have 16 PURL types, so we need 16 labels
  • Not all PURL types will have the same number of timestamps - the exporter must check the 63-character limit and remove older timestamps as needed

Implementation

Exporter Side

  • Update the exporter to set/update bucket labels when new deltas are created
  • Include the number of files (:NoF) suffix when a timestamp directory contains more than one file
  • Check if adding a new timestamp would exceed the 63-character limit per label
  • If the limit is reached, remove the oldest timestamp(s) to make room for the new one
  • Update labels atomically to avoid race conditions

Rails Side

  • Read bucket labels first (Class B operation with minimal egress)
  • Parse the label format to determine timestamps and file counts
  • Use listObject only as a fallback if labels are not available or don't contain the needed timestamp

Sync Flow

sequenceDiagram
    participant GL as GitLab Instance
    participant GCP as GCP Bucket

    GL->>GCP: Read bucket labels
    GCP-->>GL: Return labels (latest delta info)

    alt Case 1: Target delta found in label
        GL->>GCP: Fetch delta file(s)
        GCP-->>GL: Return delta
        GL->>GL: Apply delta
    else Case 2: Target delta older than deltas in the label
        GL->>GCP: listObject (fallback)
        GCP-->>GL: Return file list
        GL->>GL: Determine correct delta
        GL->>GCP: Fetch delta file
        GCP-->>GL: Return delta
        GL->>GL: Apply delta
    else Case 3: Instance already synced latest delta
        GL->>GL: Instance fully up to date (no action)
    end

Cost Impact

Reading labels is a Class B operation with minimal egress bytes (~959 bytes for all 16 PURL types with 3 timestamps each). This is significantly cheaper than the current listObject approach which generates large response payloads.

The Ruby gem we use supports this endpoint and it requires only the Storage Legacy Bucket Reader role which all our buckets already have.

Tasks

  • Exporter: Update to write bucket labels with latest timestamps per PURL type (including file count when > 1)
  • Exporter: Implement 63-character limit check and remove oldest timestamps when limit is reached
  • Rails: Update sync mechanism to read labels first, parse the timestamp:NoF format, fallback to listObject only when necessary
  • Clean up existing unrelated labels from buckets to ensure only PURL-related labels exist

Trade-offs

Pros:

  • Low cost: Reading labels is a Class B operation
  • Can read all labels in one operation
  • Reduces egress bytes significantly
  • Can be deployed now on v2 to see cost reduction

Cons:

  • Vendor lock-in: Not all providers support labels (e.g., CloudFlare doesn't)
  • Limited by 64 labels per bucket (not a real concern given current PURL type count)
  • Still consumes some egress bytes (but much less than listObject)
  • Variable number of timestamps per PURL type due to 63-character limit

/cc @ifrenkel @nilieskou @onaaman

Edited by Ahmad Zaydan