(spike) sync between monolith package metadata and the license database

Problem to solve

The GitLab instance stores package metadata used for finding licenses belonging to dependencies. The [external license database] is the importer and storer for license data from supported package registries.

A synchronization system is required to get data from the external database into the GitLab instance.

This issue is meant for the discussion and testing of this system. The output of this issue should be a decision on the considerations listed below.

Considerations

There are several considerations to discuss before in order to create robust synchronization.

Communication protocol

How the instance and external database will communicate to transmit license data.

Configuration & Discovery

Depending on the decision as to the protocol, one or both of the components would need a configuration for discovering the other component.

For example, for a solution using object storage where (i.e. external database writes its data to a gcp bucket, instance pulls data down) some way is needed to configure bucket parameters, authentication, etc.

Initial Seed vs Changes

Seeding: new instances (or those adding the package metadata feature for the first time) will need to seed the database with all the data currently stored in the external database.

Changes: the instance will need to periodically pull changes for the external database (e.g. new versions).

Data format

csv has been discussed but this might potentially be changed based on considerations above.

There's also the question of data completeness: is every new version of a package considered a change by the external database or will only changes in license be emitted?

Offline mode

Instances in offline mode may need extra operational or configuration steps. For example, will the external database be co-released with (or released as part of) GitLab and thus be an always-available service?

Outcome

This is a summary of the decisions on the on the issue's topics (see discoto section for links).

  • Communication protocol: Use public GCP bucket over HTTPS.
  • Configuration & discovery: Use a single public bucket with a well known URL without configuration.
  • Initial seed vs changes: Use the same sync protocol to seed and update the database.
  • Data format:
    • Use CSV files in a directory-like layout (one subdir per PURL type) with strictly increasing sequence ids (e.g. timestamps) and chunked by size.
    • Layout of GCP buckets: <base_uri>/<format_version>/<purl_type>/<sequence_id>/<chunk_id>
    • TBD: Do we had the .csv extension to the path of the GCP bucket?
    • The sequence ID is the Unix timestamp of the export, in UTC timezone (integer). Example: 1668056400
    • The chunk ID is an integer. The first chunk within a sequence is 1.
    • The format version is an integer. The initial format version is 1.
    • Fields of CSV files, in that specific order:
      1. package name
      2. package version
      3. comma-separated list of SPDX identifiers of licenses
  • Offline mode:
    • vendored copy of the data in the gcp bucket

/cc @brytannia @fcatteau

Auto-Summary 🤖

Discoto Usage

Points

Discussion points are declared by headings, list items, and single lines that start with the text (case-insensitive) point:. For example, the following are all valid points:

  • #### POINT: This is a point
  • \\\* point: This is a point
  • + Point: This is a point
  • - pOINT: This is a point
  • point: This is a \\\*\\\*point\\\*\\\*

Note that any markdown used in the point text will also be propagated into the topic summaries.

Topics

Topics can be stand-alone and contained within an issuable (epic, issue, MR), or can be inline.

Inline topics are defined by creating a new thread (discussion) where the first line of the first comment is a heading that starts with (case-insensitive) topic:. For example, the following are all valid topics:

  • # Topic: Inline discussion topic 1
  • ## TOPIC: \\\*\\\*{+A Green, bolded topic+}\\\*\\\*
  • ### tOpIc: Another topic

Quick Actions

Action Description
/discuss sub-topic TITLE Create an issue for a sub-topic. Does not work in epics
/discuss link ISSUABLE-LINK Link an issuable as a child of this discussion

Last updated by this job

Discoto Settings
---
summary:
  max_items: -1
  sort_by: created
  sort_direction: ascending

See the settings schema for details.

Edited by Fabien Catteau