Skip to content

Add package metadata ingestion service

Igor Frenkel requested to merge 383723-package-metadata-ingestion-service into master

What does this MR do and why?

Add a service for inserting package metadata into the rails db.

Related to: #383723 (closed)

This issue is part of Sync Rails backend with License DB (&9349 - closed) which is in turn a sub-epic of Replace license-finder MVC (&8072 - closed).

PackageMetadata::SyncService is responsible for calling this service with a list of data to be imported into the database's pm_ tables. This service is responsible for doing bulk_upserts in an idempotent way. It uses BulkInsertableTask for this change.

Note

The tables for the models being populated are currently empty. This is the MR that adds the functionality to populate them.

How to test this MR

Run the service in console and note that all dependent models are updated when the service is run. The inserts should be in a transaction.

bundle exec rails c
PackageMetadata::Ingestion::IngestionService.execute([PackageMetadata::DataObject.new('package-1','v1.0.0','Apache license','composer')]) 

Output should look something like this.

[1] pry(main)> PackageMetadata::Ingestion::IngestionService.execute([PackageMetadata::DataObject.new('foo','v1','mit','composer')]) 
  TRANSACTION (0.1ms)  BEGIN /*application:console,db_config_name:main,console_hostname:foo-glmbp-m1,console_username:foo,line:/lib/gitlab/database/schema_cache_with_renamed_table.rb:25:in `columns'*/
  #<Class:0x0000000130feda90> Upsert (0.3ms)  INSERT INTO "pm_packages" ("purl_type","name","created_at","updated_at") VALUES (1, 'foo', '2023-02-06 19:06:30.374401', '2023-02-06 19:06:30.374401') ON CONFLICT ("purl_type","name") DO UPDATE SET "updated_at"=excluded."updated_at" RETURNING "id","purl_type","name" /*application:console,db_config_name:main,console_hostname:foo-glmbp-m1,console_username:foo,line:/app/models/concerns/bulk_insert_safe.rb:163:in `block (2 levels) in _bulk_insert_all!'*/
  #<Class:0x00000001301cffd0> Upsert (0.2ms)  INSERT INTO "pm_package_versions" ("pm_package_id","version","created_at","updated_at") VALUES (697123, 'v1', '2023-02-06 19:06:30.469499', '2023-02-06 19:06:30.469499') ON CONFLICT ("pm_package_id","version") DO UPDATE SET "updated_at"=excluded."updated_at" RETURNING "id","pm_package_id","version" /*application:console,db_config_name:main,console_hostname:foo-glmbp-m1,console_username:foo,line:/app/models/concerns/bulk_insert_safe.rb:163:in `block (2 levels) in _bulk_insert_all!'*/
  #<Class:0x0000000135373e90> Upsert (0.2ms)  INSERT INTO "pm_licenses" ("spdx_identifier","created_at","updated_at") VALUES ('mit', '2023-02-06 19:06:30.476160', '2023-02-06 19:06:30.476160') ON CONFLICT ("spdx_identifier") DO UPDATE SET "updated_at"=excluded."updated_at" RETURNING "id","spdx_identifier" /*application:console,db_config_name:main,console_hostname:foo-glmbp-m1,console_username:foo,line:/app/models/concerns/bulk_insert_safe.rb:163:in `block (2 levels) in _bulk_insert_all!'*/
  #<Class:0x000000013541cae0> Upsert (0.2ms)  INSERT INTO "pm_package_version_licenses" ("pm_package_version_id","pm_license_id","created_at","updated_at") VALUES (3263720, 1138230, '2023-02-06 19:06:30.481580', '2023-02-06 19:06:30.481580') ON CONFLICT ("pm_package_version_id","pm_license_id") DO UPDATE SET "updated_at"=excluded."updated_at" /*application:console,db_config_name:main,console_hostname:foo-glmbp-m1,console_username:foo,line:/app/models/concerns/bulk_insert_safe.rb:163:in `block (2 levels) in _bulk_insert_all!'*/
  TRANSACTION (0.1ms)  COMMIT /*application:console,db_config_name:main,console_hostname:foo-glmbp-m1,console_username:foo,line:/lib/gitlab/database.rb:375:in `commit'*/

Some notable changes

Migration to add timestamps to pm_ tables

There as several data migrations to the underlying model tables to add timestamps. This has to do with the underlying ActiveRecord insert_all functionality converting on_conflict update calls to on conflict ignore. This is a peculiarity of ActiveRecord::InsertAll and after several attempts to work around it, we decided to bite the bullet and add timestamps. More information here: !108600 (comment 1234620616)

Migration to add id column

There is also a change to add an id column to pm_package_version_licenses. Because rails doesn't recognize compound primary keys things become more difficult than they have to be (especially in factories). For example doing PackageVersionLicense.create without a single primary key column causes an error because rails assumes that there is always a single primary key column to return (generating a RETURNING id clause).

Migration to change default of null on column

pm_package_version_licenses table had a default: null column on pm_package_id which is incorrect for a join table. And it also prevents upserts from working properly.

has_many relationships added to several models

This addition was not present and is not needed for bulk inserts, but ensures that factories work correctly in specs (e.g. associations work).

Change to factories and to ee/spec/lib/gitlab/license_scanning/package_licenses_spec.rb

The factories are refactored to make them more generic so that they can be used by specs in this MR and by package_licenses_spec.rb

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Igor Frenkel

Merge request reports