Skip to content

Update DB schema to store data imported from the License DB

Problem to solve

License data imported from the License Database must be directly available to the Rails backend, so that it can use it to perform License Scanning as soon as CycloneDX SBOMs are ingested.

Proposal

Create a relation table sbom_component_version_licenses between sbom_component_versions and software_licenses to store license data imported from the License DB, and track licenses under which component versions have been published.

Since a component version can't have the same license multiple times, there's a unique index on (sbom_component_versions.id, software_licenses.id), and it's used as a primary composite key; this avoids having an id column, and saves space.

It's assumed that license data imported from the License DB is immutable, and we can avoid the updated_at column to save space. The created_at column can be used to know when specific license data was added to the primary DB.

The proposed schema supports composite licenses, but it does not support SPDX licenses expression. In particular, it can't make the distinction between a conjunctive set and a disjunctive set.

The following is the SQL code corresponding to the proposal:

CREATE TABLE sbom_component_version_licenses (
    component_version_id bigint REFERENCES sbom_component_version_licenses(id) ON DELETE CASCADE,
    software_license_id integer REFERENCES software_licenses(id) ON DELETE CASCADE,
    created_at timestamp with time zone NOT NULL,
    PRIMARY KEY (component_version_id, software_license_id)
)

(Adding a primary key will automatically create a unique B-tree index on the column or group of columns listed in the primary key, and will force the column(s) to be marked NOT NULL.)

The expected size for a DB record is 8 + 4 + = 16 bytes.

Proposal 2 (accepted)

Create a set of tables representing metadata from public package registries. It'll be imported from the external database service and be read only from the monolith-side components which can then conduct queries against these pm_ tables on fields like package names and versions.

CREATE TABLE pm_packages (
    id bigint NOT NULL,
    purl_type smallint NOT NULL,
    name text NOT NULL,
    CONSTRAINT check_packages_name_max_length CHECK ((char_length(name) <= 255))
);

CREATE TABLE pm_package_versions (  
    id bigint NOT NULL,                                                                                 
    package_id bigint NOT NULL REFERENCES pm.packages(id) ON DELETE CASCADE,                                                                       
    version text NOT NULL,                                                                                            
    CONSTRAINT check_package_versions_version_max_length CHECK ((char_length(version) <= 255))                               
); 

CREATE TABLE pm_licenses (
    id integer NOT NULL,
    spdx_identifier character varying(255) NOT NULL
);

CREATE TABLE pm_package_version_licenses (
    package_version_id bigint REFERENCES pm.package_versions(id) ON DELETE CASCADE,
    license_id integer REFERENCES pm.licenses(id) ON DELETE CASCADE,
    created_at timestamp with time zone NOT NULL,
    PRIMARY KEY (component_version_id, software_license_id)
)

Implementation plan

Using proposal 2.

  • Create migrations
  • Create models and specs for their validation
  • Use separate schema alias
  • Test partitions with this schema by
    • Analyze common select queries and index stats with and without partitions and discuss in MR
Edited by Igor Frenkel