Skip to content

Add service for syncing package metadata with external license db

Problem to solve

The external license database provides the instance with license data. This is stored in object storage (public bucket or local file). The instance needs to import this data into its database.

Proposal

Add a package metadata sync service to import external license db data.

Because of the amount of data that will be stored in the data source, the service should keep track of the last synced position so that it doesn't have to import all the data in the bucket after each invocation.

Using a last sync position the service will open a connection to the correct object/file in the data source (using a dedicated connector) and stream the csv rows.

Once the csv stream is open, the service will iterate over the [package_name, version, spdx_identifiers] tuples in slices (e.g. 100 tuples at a time) and save these to the database using PackageMetadata::ImportService. The database is constrained uniquely on the data in the tuple so that duplicate data is not added. The service does not have to take care of duplicates.

Once finished, the service will store the new last sync position.

Implementation Plan

  • add PackageMetadata::SyncService::Settings under ee/app/services
    • provides data on the base_uri, supported data formats and purl_types
    • for gcp the base_uri will be the bucket name
    • for offline the base_uri will likely be a path in a filesystem
  • add PackageMetadata::SyncService under ee/app/services
    • iterates over all purl_types formats supported by the instance
    • use PackageMetadata::Connector to retrieve the connector for a service defined by [base_uri, version_format, purl_type] (2 connectors are currently defined gcp and offline)
    • retrieve the last sync position by finding PackageMetadata::SyncPosition for connection URI defined by base_uri/version_format/purl_type
    • invoke connector's #data_after method to fetch the data after the last sync position using sequence_id and chunk_id
    • invoke PackageMetadata::ImportService to store slices of 3-tuples of format [package, version, license] yielded by the connector
    • if the sequence_id or chunk_id and the new data was stored successfully, a new sync position is stored
  • add PackageMetadata::Checkpoint to store last position in data store

Pseudocode illustrating SyncService points above:

module PackageMetadata
  class SyncService
    def execute
      settings = PackageMetadata::SyncService::Settings
      base_uri = Settings.base_uri
      data_format_version = Settings.data_format_version
      purl_types = Settings.supported_purl_types

      import_service = PackageMetadata::DataImportService.new

      purl_types.each do |purl_type|
        checkpoint = PackageMetadata::Checkpoint.for_format_and_purl_type(data_format_version, purl_type)
        connector_for(base_uri, data_format_version, purl_type)
          .data_after(checkpoint)
          .each do |file|
            file.each_slice(100) do |data_objects|
              Ingestion::IngestionService.execute(data_objects)
            end
            checkpoint.update(sequence_id: file.sequence, chunk_id: file.chunk)
          end
      end
    end
  end
end

Idempotency and sync position storage

The database schema is structured to skip duplicates. So if an error occurs and the most current sync position is not saved, restarting at the previous sync position will not cause data corruption as duplicates will be ignored.

Edited by Igor Frenkel