Skip to content

Pre-generate package dependencies

What does this MR do and why?

The aim of this MR is to get rid of a loop over Packages::DependencyLink when generating package's metadata.
Why? Because for a case when a package has 1k versions and each version has about 100 dependencies we have to loop over 100k dependency links. If we'll be able to not load all dependency links for every package and avoid looping through them, but rather load aggregated data - grouped dependency id by dependency type, it'll significantly speed up the metadata generation and the metadata endpoint in general.

So this MR introduced two hashes dependencies and dependency_ids.
The first dependencies is supposed to hold dependencies with the required attributes and play a cache role between batches of packages.
It looks like <dependency id> : { <dependency name> : <dependency version_pattern> } The second dependency_ids keeps the relation between a package and its dependencies and looks like: <package id> : { <dependency type> => [<dependency 1 id>, <dependency 2 id>, ...], ... }

Then when generating package's metadata we could use those two hashes to build up package's dependencies and avoid a loop through package's dependency links.

Screenshots or screen recordings

Benchmarks

To benchmark the service I prepared the following data locally:

# 2723 package versions. Yes, this is a real case.
# 348545 package dependency links.
# 129 package dependencies for every package version.
# generate_metadata_ips.rb

require 'benchmark/ips'
require_relative 'config/environment'

Benchmark.ips do |x|
  x.report('Packages::Npm::GenerateMetadataService#execute') do
    name = 'XXX'
    packages = Packages::Package.where(name: name)
    Packages::Npm::GenerateMetadataService.new(name, packages).execute
  end
end

Before

➜  gitlab git:(392448-generate-p...) ✗ ruby generate_metadata_ips.rb 
Warming up --------------------------------------
Packages::Npm::GenerateMetadataService#execute
                         1.000  i/100ms
Calculating -------------------------------------
Packages::Npm::GenerateMetadataService#execute
                          0.156  (± 0.0%) i/s -      1.000  in   6.425440s

After

➜  gitlab git:(392448-generate-p...) ✗ ruby generate_metadata_ips.rb 
Warming up --------------------------------------
Packages::Npm::GenerateMetadataService#execute
                         1.000  i/100ms
Calculating -------------------------------------
Packages::Npm::GenerateMetadataService#execute
                          1.284  (± 0.0%) i/s -      7.000  in   5.527915s

I wasn't quite sure about the benchmarks and created the screen recordings:

before

before

It ends up with Timeout error after 60s

after

after

Quite fast 🚀

How to set up and validate locally

  1. The feature is behind the feature flag. Given that, the first step is to enable it:

    Feature.enable(:npm_optimize_metadata_generation)
  2. Create a package with dependencies:

    def fixture_file_upload(*args, **kwargs)
      Rack::Test::UploadedFile.new(*args, **kwargs)
    end
    
    p = FactoryBot.create(:npm_package, project: Project.first, name: 'test')
    
    FactoryBot.create(:packages_dependency) do |d|
       FactoryBot.create(:packages_dependency_link, package: p, dependency: d)
    end
  3. Query package's metadata

    $ curl --header "PRIVATE-TOKEN: <PAT>" "http://gdk.test:3000/api/v4/projects/<project_id>/packages/npm/test"

    The server should return generated package's metadata

Database analysis

For all database query analysis I've used the existing package that has 2742 versions with 326399 dependency links and 129 dependencies.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #392448 (closed)

Edited by Dzmitry (Dima) Meshcharakou

Merge request reports