Reconsider the package table structure and design.
Summary
This is meant to be a discussion space concerning the database structure of the package data. Currently, this includes NPM, Maven, Conan, and NuGet. Eventually to also include Composer, PyPi, RubyGems, Cargo, and many more.
Current solution and its problems
The existing solution is structured somewhat like single-table inheritance without imposing any of the constraints that come with that design:
Table "public.packages_packages"
Column | Type | Collation | Nullable | Default
--------------+--------------------------+-----------+----------+-----------------------------------------------
id | bigint | | not null | nextval('packages_packages_id_seq'::regclass)
project_id | integer | | not null |
created_at | timestamp with time zone | | not null |
updated_at | timestamp with time zone | | not null |
name | character varying | | not null |
version | character varying | | |
package_type | smallint | | not null |
Indexes:
"packages_packages_pkey" PRIMARY KEY, btree (id)
"idx_packages_packages_on_project_id_name_version_package_type" btree (project_id, name, version, package_type)
"index_packages_packages_on_name_trigram" gin (name gin_trgm_ops)
"index_packages_packages_on_project_id" btree (project_id)
Foreign-key constraints:
"fk_rails_e1ac527425" FOREIGN KEY (project_id) REFERENCES projects(id) ON DELETE CASCADE
And an example of one of the metadata tables:
Table "public.packages_conan_metadata"
Column | Type | Collation | Nullable | Default
------------------+--------------------------+-----------+----------+-----------------------------------------------------
id | bigint | | not null | nextval('packages_conan_metadata_id_seq'::regclass)
package_id | bigint | | not null |
created_at | timestamp with time zone | | not null |
updated_at | timestamp with time zone | | not null |
package_username | character varying(255) | | not null |
package_channel | character varying(255) | | not null |
Indexes:
"packages_conan_metadata_pkey" PRIMARY KEY, btree (id)
"index_packages_conan_metadata_on_package_id" UNIQUE, btree (package_id)
Foreign-key constraints:
"fk_rails_8c68cfec8b" FOREIGN KEY (package_id) REFERENCES packages_packages(id) ON DELETE CASCADE
Each metadata
table has different attributes, specific to that package type (defined by package.package_type
). So we have models like:
class Packages::Package < ApplicationRecord
belongs_to :project
has_one :conan_metadatum, inverse_of: :package
end
class Packages::ConanMetadatum < ApplicationRecord
belongs_to :package, inverse_of: :conan_metadatum
end
Problems arise when trying to constrain a given package type based on it's own unique naming restrictions. For example, NPM packages are constrained by checking for uniqueness between
package.name
package.version
when package.package_type == 'npm'
, whereas Conan needs to be unique by
package.name
package.version
conan_metadatum.package_username
conan_metadatum.package_channel
where package.package_type == 'conan'
. This prevents us from adding a unique index on name/version/package_type/project_id
in the packages_packages
table, which would benefit NPM and Maven.
Discussion on this topic of adding a unique constraint over two tables for the case of Conan led to taking a step back and questioning "is there a larger underlying problem with the structure of this data".
The structure of the package data has been lightly discussed before, here is a link to a separate issue discussing the intent of the current structure.
Possible solution
Splitting each package type into its own table allows the data for a given package type to be indexed and constrained exactly as it needs to be.
The Packages::Package
model would become an abstract class that the specific package models would implement in order to have a commonly shared interface for things like package.name
, package.version
, package.metadata
.
It would need to allow for behavior like project.packages
to collect all packages regardless of type.
Improvements
We have more control over each package type at the database level, and do not need to worry about overlapping constraints imposed by other package types.
Packages could be treated as a "duck type", where all packages have a common interface, but then when we get to the unique behavior and attributes for each individual package, we don't have to manage keeping them separate in any way.
Risks
Right now we have a project that has many packages, and it does not distinguish between package_type, so fetching all packages is easy. Separating out each type will make this more complex, likely living within an abstraction or module.
We want each package to behave the same, despite having different properties, so it makes sense that we would want a unified "minimal" table/model with all of the shared behavior, and delegate specific behavior to the specific package types.
Involved components
All existing NPM, Maven, Conan, and NuGet models and associated APIs/Services may be affected.
Existing metadata tables could be renamed to be the base table for each package (packages_maven_metadata
becomes packages_maven
), and the name
and version
columns would be added to those tables, migrated from the existing packages_packages
table before it is dropped.
Optional: Intended side effects
Adding new package behavior/constraints does not have any affect on existing package types, and is not prevented by existing package types as the Conan unique index is somewhat prevented.