Skip to content

Ingest advisory and affected package data to DB

What does this MR do and why?

Ingest advisory and affected package data to DB

This MR is similar to Add package metadata ingestion for version form... (!120027 - merged), however, instead of ingesting license data, it ingests advisory and affected_package data.

2 tables are touched in the process of ingestion: pm_advisories and pm_affected_packages.

  1. Advisory data is collected from the slice of objects passed to the ingestion service and upserted into pm_advisories. The the advisory_xid and source_xid keys are used to determine whether to insert or add a new record.
  2. An advisory_map of advisory_xid => advisory database id is built for each record upserted.
  3. Each advisory might have multiple affected packages, which we loop through and upsert into the pm_affected_packages table. Each pm_affected_packages record is linked to the parent advisory by setting the pm_affected_packages.pm_license_id value using the advisory_map from step 2..

Database changes

This MR updates the pm_affected_packages.distro_version column to DEFAULT NOT NULL as explained in this comment.

Characteristics of ingested data

Initially, we'll only be supporting the gemnasium-db as a data source for advisories. The current size of the exported advisory data is around 30MB:

$ gsutil -m rsync -r -d gs://prod-export-advisory-bucket-1a6c642fc4de57d4 $GITLAB_RAILS_ROOT_DIR/vendor/package_metadata/advisories

du -h $GITLAB_RAILS_ROOT_DIR/vendor/package_metadata/advisories

5.0M	vendor/package_metadata/advisories/v2/pypi
2.8M	vendor/package_metadata/advisories/v2/go
7.7M	vendor/package_metadata/advisories/v2/maven
2.4M	vendor/package_metadata/advisories/v2/nuget
4.5M	vendor/package_metadata/advisories/v2/packagist
716K	vendor/package_metadata/advisories/v2/conan
5.1M	vendor/package_metadata/advisories/v2/npm
2.1M	vendor/package_metadata/advisories/v2/rubygem
 30M	vendor/package_metadata/advisories

Eventually, we'll support other sources of advisory data, such as trivy-db-glad which is around 360MB:

$ oras pull registry.gitlab.com/gitlab-org/security-products/dependencies/trivy-db-glad:2
$ tar -xzf db.tar.gz
$ ls -alh trivy.db                                                                                                                                             
-rw-------  1 adam  wheel   361M Jul 10 13:11 trivy.db

How to set up and validate locally

  1. Create new directory for advisories in $GITLAB_RAILS_ROOT_DIR/vendor/package_metadata/advisories:

    mkdir -p $GITLAB_RAILS_ROOT_DIR/vendor/package_metadata/advisories
  2. Install the gsutil tool.

  3. Sync package advisory bucket using gsutil:

    gsutil -m rsync -r -d gs://prod-export-advisory-bucket-1a6c642fc4de57d4 $GITLAB_RAILS_ROOT_DIR/vendor/package_metadata/advisories
  4. Open the rails console and start the sync process:

    PM_SYNC_IN_DEV=true rails c
    
    [1] pry(main)> Feature.enable(:package_metadata_advisory_sync)
    
    [2] pry(main)> module PackageMetadata
      class MyAdvisoriesSyncWorker
        include ExclusiveLeaseGuard
    
        def lease_timeout
          5.minutes
        end
    
        def perform
          try_obtain_lease do
            SyncService.execute(data_type: 'advisories', lease: exclusive_lease)
          end
        end
      end
    end
    
    [3] pry(main)> PackageMetadata::MyAdvisoriesSyncWorker.new.perform

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #406836 (closed)

Edited by Adam Cohen

Merge request reports