(spike) sync between monolith package metadata and the license database
Problem to solve
The GitLab instance stores package metadata used for finding licenses belonging to dependencies. The [external license database] is the importer and storer for license data from supported package registries.
A synchronization system is required to get data from the external database into the GitLab instance.
This issue is meant for the discussion and testing of this system. The output of this issue should be a decision on the considerations listed below.
Considerations
There are several considerations to discuss before in order to create robust synchronization.
Communication protocol
How the instance and external database will communicate to transmit license data.
Configuration & Discovery
Depending on the decision as to the protocol, one or both of the components would need a configuration for discovering the other component.
For example, for a solution using object storage where (i.e. external database writes its data to a gcp bucket, instance pulls data down) some way is needed to configure bucket parameters, authentication, etc.
Initial Seed vs Changes
Seeding: new instances (or those adding the package metadata feature for the first time) will need to seed the database with all the data currently stored in the external database.
Changes: the instance will need to periodically pull changes for the external database (e.g. new versions).
Data format
csv has been discussed but this might potentially be changed based on considerations above.
There's also the question of data completeness: is every new version of a package considered a change by the external database or will only changes in license be emitted?
Offline mode
Instances in offline mode may need extra operational or configuration steps. For example, will the external database be co-released with (or released as part of) GitLab and thus be an always-available service?
Outcome
This is a summary of the decisions on the on the issue's topics (see discoto section for links).
- Communication protocol: Use public GCP bucket over HTTPS.
- Configuration & discovery: Use a single public bucket with a well known URL without configuration.
- Initial seed vs changes: Use the same sync protocol to seed and update the database.
- Data format:
- Use CSV files in a directory-like layout (one subdir per PURL type) with strictly increasing sequence ids (e.g. timestamps) and chunked by size.
- Layout of GCP buckets:
<base_uri>/<format_version>/<purl_type>/<sequence_id>/<chunk_id> -
TBD: Do we had the
.csvextension to the path of the GCP bucket? - The sequence ID is the Unix timestamp of the export, in UTC timezone (integer). Example:
1668056400 - The chunk ID is an integer. The first chunk within a sequence is
1. - The format version is an integer. The initial format version is
1. - Fields of CSV files, in that specific order:
- package name
- package version
- comma-separated list of SPDX identifiers of licenses
- Offline mode:
- vendored copy of the data in the gcp bucket
/cc @brytannia @fcatteau
Auto-Summary 🤖
Discoto Usage
Points
Discussion points are declared by headings, list items, and single lines that start with the text (case-insensitive)
point:. For example, the following are all valid points:
#### POINT: This is a point\\\* point: This is a point+ Point: This is a point- pOINT: This is a pointpoint: This is a \\\*\\\*point\\\*\\\*Note that any markdown used in the point text will also be propagated into the topic summaries.
Topics
Topics can be stand-alone and contained within an issuable (epic, issue, MR), or can be inline.
Inline topics are defined by creating a new thread (discussion) where the first line of the first comment is a heading that starts with (case-insensitive)
topic:. For example, the following are all valid topics:
# Topic: Inline discussion topic 1## TOPIC: \\\*\\\*{+A Green, bolded topic+}\\\*\\\*### tOpIc: Another topicQuick Actions
Action Description /discuss sub-topic TITLECreate an issue for a sub-topic. Does not work in epics /discuss link ISSUABLE-LINKLink an issuable as a child of this discussion
Last updated by this job
- TOPIC Communication protocol #379137 (comment 1151668912)
- TOPIC Configuration & Discovery #379137 (comment 1151671830)
- TOPIC Initial seed vs changes #379137 (comment 1151674945)
- TOPIC One file per package type? #379137 (comment 1157883478)
- TOPIC Idempotent import #379137 (comment 1159223166)
-
TOPIC Yearly, weekly, daily updates #379137 (comment 1159248296)
- Don't need this #379137 (comment 1161942227)
-
TOPIC License name #379137 (comment 1160067858)
- From License DB #379137 (comment 1160067858)
- From SPDX License List #379137 (comment 1160067858)
- Why do we need this? #379137 (comment 1160067858)
- License policies #379137 (comment 1160082741)
- Don't need this. #379137 (comment 1160091159)
- TOPIC Metadata #379137 (comment 1161911425)
- TOPIC Conclusion #379137 (comment 1174780636)
Discoto Settings
---
summary:
max_items: -1
sort_by: created
sort_direction: ascending
See the settings schema for details.