[PROMOTED] Sync Rails backend with License DB

Problem to solve

GitLab instances need to stay in sync with the License Database, which is exported on a regular basis.

Proposal

Check exports of the License DB periodically, and insert new license data into the DB. See Export License DB (#373030 - closed).

Currently the needs of the instance are met by using the simplest possible method.

Method:

use a GCP bucket to synchronize data for non-offline instances
use csv as the format, where each row describes the licenses for a particular package version
use a unique identifier to denote position in sequence, version of format used, and package type being imported
- caller can use sequence identifier to skip those already stored because new sequence identifiers are added as a strictly increasing integer
for offline mode create a rake task to trigger synchronization against a local file on disk

Implementation plan

The plan has been broken into sub issues by functionality:

background worker Add scheduled sync background worker for packag... (#383719 - closed)
sync service Add service for syncing package metadata with e... (#383722 - closed)
data import service Add service to import package metadata into the DB (#383723 - closed)
gcp storage connector Add GCP bucket connector for fetching package m... (#383797 - closed)
offline storage connector (pending)

(old) Implementation plan

This implementation plan had been superseded by the issues above. It has not been removed as there are discussions linking to it.

background worker
- scheduled background job that triggers the sync
sync service
- retrieve last sync position (see Store sync position section below)
- use connector (see Connectors section below) and pass purl-type, sequence-id, chunk-id
- using stream yielded by connector
  - iterate over [package_name, version, spdx_identifiers] tuples
  - convert these into batches (e.g. PackageMetadata::Batch) which implement BulkInsertableTask to update the package metadata models
- save sync position
data import service (bulk insert)
- save {package, version, license} tuples into the database
- use BulkInsertableTask
- provide caller with callback when save occurred
data connectors
- gcp bucket Add GCP bucket connector for fetching package m... (#383797 - closed)
  - responsible for establishing a connection to the GCP bucket
  - can seek to sequence/chunk within bucket
  - opens a CSV stream and extracts tuples
- offline storage (tbd)
  - responsible for opening compressed package metadata file
  - can seek to sequence/chunk within directory stored in file
  - opens a CSV stream and extracts tuples

Connectors

Synchronization will require 2 types of connections: to the GCP bucket and to a "local" file when the instance is offline. Thus the connectors can expose the same API for callers while implementing different connection types.

Note: format version 1 will use csv as the data format so the connection should be able to parse the CSV rows and provide a stream of well-formed tuples to the caller.

Pseudo code to illustrate how caller may use the connectors.

module PackageMetadata
  class SyncService
    def sync(sync_uri)
      position = SyncPosition.find(sync_uri)
      connector_for(position.base_uri, position.format_version, position.purl_type)
        .position_after(position.sequence_id, position.chunk_id)
        .slice_of(slice_size) do |sequence_id, tuple|
          if num_tuples_consumed > batch
            batch.insert_all!
            batch = PackageMetadata::DataBatch.new(position.purl_type)
            num_tuples_consumed = 0
          end

          batch.add(sequence_id, tuple.package_name, tuple.package_version, tuple.spdx_identifier)
          num_tuples_consumed += 1
        end
    end
  end
end

Store sync position

introduce PackageMetadata::SyncPosition model with a backing table
- store a tuple of [purl_type, position identifier, timestamp]
- position identifier will have the structure: <base_uri>/<format_version>/<purl_type>/<sequence_id>/<chunk_id>
  - base_uri points to either a file:// or https://

Sequence ID usage

The sequence-id is used as a "cursor" and as a unique identifier. That is, the client must find the sequence-id in the bucket, otherwise it must assume that the sequence-id is zero. Some example use cases follow.

Given the following structure:

root
- 5
  - 1668056400
    - 1.csv
    - 2.csv
    - 3.csv
  - 1668099600
    - 1.csv
    - 2.csv
  - 1668488400
    - 1.csv

Use case 1: sequence-id and chunk-id found

Client stored: <bucket>/5/1668099600/1.csv

Client finds both sequence-id (1668099600) and chunk-id (1.csv). Client should start at 1668099600/2.csv.

Use case 2: client had last chunk-id of a sequence

Client stored: <bucket>/5/1668099600/2.csv

Client finds both sequence-id (1668099600) and the last chunk-id (2.csv) in sequence. Client should start at 1668488400/1.csv.

Use case 3: client does not find sequence-id

Client stored: <bucket>/5/1668099700/1.csv

Client doest not find sequence-id (1668099700). Client rewinds their "cursor" and starts at the first sequence for the purl_type in the bucket: 1668056400 and 1.csv.

Offline functionality

create a rake task to package bucket data into a compressed file (using sequence id stored on instance)

Edited Nov 29, 2022 by Igor Frenkel