[PROMOTED] Sync Rails backend with License DB

Problem to solve

GitLab instances need to stay in sync with the License Database, which is exported on a regular basis.

Proposal

Check exports of the License DB periodically, and insert new license data into the DB. See Export License DB (#373030 - closed).

Currently the needs of the instance are met by using the simplest possible method.

Method:

  1. use a GCP bucket to synchronize data for non-offline instances
  2. use csv as the format, where each row describes the licenses for a particular package version
  3. use a unique identifier to denote position in sequence, version of format used, and package type being imported
    • caller can use sequence identifier to skip those already stored because new sequence identifiers are added as a strictly increasing integer
  4. for offline mode create a rake task to trigger synchronization against a local file on disk

Implementation plan

The plan has been broken into sub issues by functionality:

(old) Implementation plan

This implementation plan had been superseded by the issues above. It has not been removed as there are discussions linking to it.

  • background worker
    • scheduled background job that triggers the sync
  • sync service
    • retrieve last sync position (see Store sync position section below)
    • use connector (see Connectors section below) and pass purl-type, sequence-id, chunk-id
    • using stream yielded by connector
      • iterate over [package_name, version, spdx_identifiers] tuples
      • convert these into batches (e.g. PackageMetadata::Batch) which implement BulkInsertableTask to update the package metadata models
    • save sync position
  • data import service (bulk insert)
    • save {package, version, license} tuples into the database
    • use BulkInsertableTask
    • provide caller with callback when save occurred
  • data connectors
    • gcp bucket Add GCP bucket connector for fetching package m... (#383797 - closed)
      • responsible for establishing a connection to the GCP bucket
      • can seek to sequence/chunk within bucket
      • opens a CSV stream and extracts tuples
    • offline storage (tbd)
      • responsible for opening compressed package metadata file
      • can seek to sequence/chunk within directory stored in file
      • opens a CSV stream and extracts tuples

Connectors

Synchronization will require 2 types of connections: to the GCP bucket and to a "local" file when the instance is offline. Thus the connectors can expose the same API for callers while implementing different connection types.

Note: format version 1 will use csv as the data format so the connection should be able to parse the CSV rows and provide a stream of well-formed tuples to the caller.

Pseudo code to illustrate how caller may use the connectors.

module PackageMetadata
  class SyncService
    def sync(sync_uri)
      position = SyncPosition.find(sync_uri)
      connector_for(position.base_uri, position.format_version, position.purl_type)
        .position_after(position.sequence_id, position.chunk_id)
        .slice_of(slice_size) do |sequence_id, tuple|
          if num_tuples_consumed > batch
            batch.insert_all!
            batch = PackageMetadata::DataBatch.new(position.purl_type)
            num_tuples_consumed = 0
          end

          batch.add(sequence_id, tuple.package_name, tuple.package_version, tuple.spdx_identifier)
          num_tuples_consumed += 1
        end
    end
  end
end

Store sync position

  • introduce PackageMetadata::SyncPosition model with a backing table
    • store a tuple of [purl_type, position identifier, timestamp]
    • position identifier will have the structure: <base_uri>/<format_version>/<purl_type>/<sequence_id>/<chunk_id>
      • base_uri points to either a file:// or https://

Sequence ID usage

The sequence-id is used as a "cursor" and as a unique identifier. That is, the client must find the sequence-id in the bucket, otherwise it must assume that the sequence-id is zero. Some example use cases follow.

Given the following structure:

root
- 5
  - 1668056400
    - 1.csv
    - 2.csv
    - 3.csv
  - 1668099600
    - 1.csv
    - 2.csv
  - 1668488400
    - 1.csv

Use case 1: sequence-id and chunk-id found

Client stored: <bucket>/5/1668099600/1.csv

Client finds both sequence-id (1668099600) and chunk-id (1.csv). Client should start at 1668099600/2.csv.

Use case 2: client had last chunk-id of a sequence

Client stored: <bucket>/5/1668099600/2.csv

Client finds both sequence-id (1668099600) and the last chunk-id (2.csv) in sequence. Client should start at 1668488400/1.csv.

Use case 3: client does not find sequence-id

Client stored: <bucket>/5/1668099700/1.csv

Client doest not find sequence-id (1668099700). Client rewinds their "cursor" and starts at the first sequence for the purl_type in the bucket: 1668056400 and 1.csv.

Offline functionality

  • create a rake task to package bucket data into a compressed file (using sequence id stored on instance)
Edited by Igor Frenkel