Skip to content

NPM package registry: fine tune a query for the instance level

David Fernandez requested to merge 339835-tweak-npm-package-finder into master

👽 Context

The NPM registry can be accessed at two "levels":

  • The project level
  • The instance level

At the instance level, only packages that follow the naming convention are available.

How do we locate the correct package? Scanning all the NPM packages for a given GitLab instance would go through way too many packages. What we do is that we leverage the naming convention to find the project hosting the packages.

  1. We extract the package scope.
  2. From that scope, we locate the namespace.
  3. We execute a package finder for that namespace.
  4. We look at the most recent package and read the project_id. We just reduced an instance access level to a project access level.
    • Accessing packages within a project scope is obviously way more efficient than trying to scan all existing npm packages.

These steps are nicely packed here.

The 💥 is that the finder uses this scope. The scope is before my time but I'm assuming that during the early versions of the npm registry, packages were duplicated which means that in the Packages::Package table we could find two npm packages with the same name and same version. That scope ensures that only the most recent row is used.

Looking at the source of that scope, we can see that all is called in a subquery. The problem is that all will basically use the current conditions attached to the main query to use them in the subquery = the conditions will appear twice.

Back to step (3.) above, going through all the packages of a given root namespace can be expensive. Some namespaces have a high number of nested sub groups and basically, the finder will need to use the hierarchy traversal conditions. See https://gitlab.com/gitlab-org/gitlab/-/blob/7eac953926529288a5a332c18a058e8dcc4efb61/app/models/namespace.rb#L283. Those conditions efficiency depends on the number of objects nested in the root namespace.

Now combine both points and what is the result? This. By reusing a condition that is hard to execute twice, we get an overall slow query:

Query
SELECT "packages_packages".*
  FROM "packages_packages"
 WHERE "packages_packages"."project_id" IN (
        SELECT "projects"."id"
          FROM "projects"
         WHERE "projects"."namespace_id" IN (
                SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
                  FROM "namespaces"
                 WHERE (traversal_ids @> ('{XXX}'))
               )
       )
   AND "packages_packages"."package_type" = 2
   AND "packages_packages"."name" = 'XXX'
   AND "packages_packages"."status" = 0
   AND "packages_packages"."id" IN (
        SELECT MAX(id) AS id
          FROM "packages_packages"
         WHERE "packages_packages"."project_id" IN (
                SELECT "projects"."id"
                  FROM "projects"
                 WHERE "projects"."namespace_id" IN (
                        SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
                          FROM "namespaces"
                         WHERE (traversal_ids @> ('{XXX}'))
                       )
               )
           AND "packages_packages"."package_type" = 2
           AND "packages_packages"."name" = 'XXX'
           AND "packages_packages"."status" = 0
         GROUP BY "packages_packages"."version"
       )
 ORDER BY "packages_packages"."id" DESC
 LIMIT 1;

This is issue #339835 (closed).

🔧 Solution

How can we improve things?

Looking at the query above, we see that rails is smart enough to compact the scope + .last + &.project_id in a single query. In other words, the query is already returning the project id linked to the most recent package that has the name that we look for.

Because, we're only interested in the most recent package, we don't need to use the scope at all. Why handle duplicated packages when we're basically using the most recent package? This means that we don't need the .last_of_each_version scope for this call.

The finder being used in multiple locations, we can't simply drop the scope from the finder. We need to update it so that it accepts an option that let clients of this finder control if they want to use the scope or not.

Lastly, because the npm package registry is one of the most used on gitlab.com, the changes here are going to be deployed behind a feature flag. This is to gives us a chance to rollout the fix incrementally and to have a safety net to fall back to the old implementation quickly if an issue occurs.

🤔 What does this MR do?

  • Update Packages::Npm::PackageFinder to accept last_of_each_version option. That option will be enabled by default.
  • Update the instance level access to don't use that option when searching for the project id.
  • Gate those changes behind a feature flag: npm_finder_query_avoid_duplicated_conditions
  • Update the related specs

📸 Screenshots or Screencasts (strongly suggested)

Here is the original query:

Original query
SELECT "packages_packages".*
  FROM "packages_packages"
 WHERE "packages_packages"."project_id" IN (
        SELECT "projects"."id"
          FROM "projects"
         WHERE "projects"."namespace_id" IN (
                SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
                  FROM "namespaces"
                 WHERE (traversal_ids @> ('{XXX}'))
               )
       )
   AND "packages_packages"."package_type" = 2
   AND "packages_packages"."name" = 'XXX'
   AND "packages_packages"."status" = 0
   AND "packages_packages"."id" IN (
        SELECT MAX(id) AS id
          FROM "packages_packages"
         WHERE "packages_packages"."project_id" IN (
                SELECT "projects"."id"
                  FROM "projects"
                 WHERE "projects"."namespace_id" IN (
                        SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
                          FROM "namespaces"
                         WHERE (traversal_ids @> ('{XXX}'))
                       )
               )
           AND "packages_packages"."package_type" = 2
           AND "packages_packages"."name" = 'XXX'
           AND "packages_packages"."status" = 0
         GROUP BY "packages_packages"."version"
       )
 ORDER BY "packages_packages"."id" DESC
 LIMIT 1;

https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/6369/commands/21713

Here is the query without using the scope:

Query without scope
SELECT "packages_packages".*
  FROM "packages_packages"
 WHERE "packages_packages"."project_id" IN (
        SELECT "projects"."id"
          FROM "projects"
         WHERE "projects"."namespace_id" IN (
                SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id
                  FROM "namespaces"
                 WHERE (traversal_ids @> ('{XXX}'))
               )
       )
   AND "packages_packages"."package_type" = 2
   AND "packages_packages"."name" = 'XXX'
   AND "packages_packages"."status" = 0
 ORDER BY "packages_packages"."id" DESC
 LIMIT 1;

https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/6393/commands/21819

As wanted, the subquery (with the MAX) has been removed.

It's a bit hard to evaluate the impact of one less subquery as the conditions for group hierarchy traversals behave a bit strangely on pg.ai. Sometimes they are fast and other times they are slower. We still believe that we should have better execution times without the subquery.

How to setup and validate locally (strongly suggested)

Group hierarchy traversals seems to have several improvements in progress. To get the query that we have on production, we need to enable both feature flag:

  • use_traversal_ids
  • recursive_approach_for_all_projects

Please note that no matter what is the form of the group hierarchy traversal condition, this MR will still remove a duplicated condition.

  1. Have a user with a personal access token ready (scope api)
  2. Create a group
  3. Create a project inside that group
  4. With gl_pru, upload 3-4 packages but be careful to use this form for the package name:
    $ bundle exec thor package:push --package-type=npm --user=<username> --token=<pat_token> --url=<gitlab_base_url>/api/v4/projects/<project_id>/packages/npm/ --name=test --version=1.3.8 --npm-package-scope=@<group_path>
  5. In a rails console:
    Packages::Npm::PackageFinder.new('@<group_path>/test', namespace: Namespace.top_most.by_path('<group_path>'), last_of_each_version: true)
    Packages::Npm::PackageFinder.new('@<group_path>/test', namespace: Namespace.top_most.by_path('<group_path>'), last_of_each_version: false)

When last_of_each_version is set to true, it will give you the query currently used on gitlab.com.

When last_of_each_version is set to false, it will give you the query as modified by this MR with the feature flag is enabled.

📐 Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

Does this MR contain changes to processing or storing of credentials or tokens, authorization and authentication methods or other items described in the security review guidelines? If not, then delete this Security section.

  • [-] Label as security and @ mention @gitlab-com/gl-security/appsec
  • [-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • [-] Security reports checked/validated by a reviewer from the AppSec team

💿 Database review

Edited by David Fernandez

Merge request reports