[PROMOTED] Sync Rails backend with License DB
Problem to solve
GitLab instances need to stay in sync with the License Database, which is exported on a regular basis.
Proposal
Check exports of the License DB periodically, and insert new license data into the DB. See Export License DB (#373030 - closed).
Currently the needs of the instance are met by using the simplest possible method.
Method:
- use a GCP bucket to synchronize data for non-offline instances
- use
csvas the format, where each row describes the licenses for a particular package version - use a unique identifier to denote position in sequence, version of format used, and package type being imported
- caller can use sequence identifier to skip those already stored because new sequence identifiers are added as a strictly increasing integer
- for offline mode create a rake task to trigger synchronization against a local file on disk
Implementation plan
The plan has been broken into sub issues by functionality:
-
background worker Add scheduled sync background worker for packag... (#383719 - closed) -
sync service Add service for syncing package metadata with e... (#383722 - closed) -
data import service Add service to import package metadata into the DB (#383723 - closed) -
gcp storage connector Add GCP bucket connector for fetching package m... (#383797 - closed) -
offline storage connector (pending)
(old) Implementation plan
This implementation plan had been superseded by the issues above. It has not been removed as there are discussions linking to it.
-
background worker - scheduled background job that triggers the sync
-
sync service - retrieve last sync position (see Store sync position section below)
- use connector (see Connectors section below) and pass
purl-type,sequence-id,chunk-id - using stream yielded by connector
- iterate over
[package_name, version, spdx_identifiers]tuples - convert these into batches (e.g.
PackageMetadata::Batch) which implementBulkInsertableTaskto update the package metadata models
- iterate over
- save sync position
-
data import service (bulk insert) - save
{package, version, license}tuples into the database - use
BulkInsertableTask - provide caller with callback when save occurred
- save
- data connectors
- gcp bucket Add GCP bucket connector for fetching package m... (#383797 - closed)
- responsible for establishing a connection to the GCP bucket
- can seek to sequence/chunk within bucket
- opens a CSV stream and extracts tuples
- offline storage (tbd)
- responsible for opening compressed package metadata file
- can seek to sequence/chunk within directory stored in file
- opens a CSV stream and extracts tuples
- gcp bucket Add GCP bucket connector for fetching package m... (#383797 - closed)
Connectors
Synchronization will require 2 types of connections: to the GCP bucket and to a "local" file when the instance is offline. Thus the connectors can expose the same API for callers while implementing different connection types.
Note: format version 1 will use csv as the data format so the connection should be able to parse the CSV rows and provide a stream of well-formed tuples to the caller.
Pseudo code to illustrate how caller may use the connectors.
module PackageMetadata
class SyncService
def sync(sync_uri)
position = SyncPosition.find(sync_uri)
connector_for(position.base_uri, position.format_version, position.purl_type)
.position_after(position.sequence_id, position.chunk_id)
.slice_of(slice_size) do |sequence_id, tuple|
if num_tuples_consumed > batch
batch.insert_all!
batch = PackageMetadata::DataBatch.new(position.purl_type)
num_tuples_consumed = 0
end
batch.add(sequence_id, tuple.package_name, tuple.package_version, tuple.spdx_identifier)
num_tuples_consumed += 1
end
end
end
end
Store sync position
-
introduce PackageMetadata::SyncPositionmodel with a backing table-
store a tuple of [purl_type, position identifier, timestamp] -
position identifier will have the structure: <base_uri>/<format_version>/<purl_type>/<sequence_id>/<chunk_id>-
base_uripoints to either afile://orhttps://
-
-
Sequence ID usage
The sequence-id is used as a "cursor" and as a unique identifier. That is, the client must find the sequence-id in the bucket, otherwise it must assume that the sequence-id is zero. Some example use cases follow.
Given the following structure:
root
- 5
- 1668056400
- 1.csv
- 2.csv
- 3.csv
- 1668099600
- 1.csv
- 2.csv
- 1668488400
- 1.csv
Use case 1: sequence-id and chunk-id found
Client stored: <bucket>/5/1668099600/1.csv
Client finds both sequence-id (1668099600) and chunk-id (1.csv). Client should start at 1668099600/2.csv.
Use case 2: client had last chunk-id of a sequence
Client stored: <bucket>/5/1668099600/2.csv
Client finds both sequence-id (1668099600) and the last chunk-id (2.csv) in sequence. Client should start at 1668488400/1.csv.
Use case 3: client does not find sequence-id
Client stored: <bucket>/5/1668099700/1.csv
Client doest not find sequence-id (1668099700). Client rewinds their "cursor" and starts at the first sequence for the purl_type in the bucket: 1668056400 and 1.csv.
Offline functionality
-
create a rake task to package bucket data into a compressed file (using sequence id stored on instance)