Skip to content

Optimize the query used to find NPM packages inside a namespace

What does this MR do and why?

The query to get the packages within a namespace is complex and includes certain subqueries twice:

EXPLAIN SELECT "packages_packages".* 
FROM "packages_packages" 
WHERE "packages_packages"."project_id" IN 
  (SELECT "projects"."id" 
  FROM "projects" 
  WHERE "projects"."namespace_id" IN 
    (SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id 
    FROM "namespaces" 
    WHERE "namespaces"."type" = 'Group' 
    AND (traversal_ids @> ('{39}')))
  ) 
AND "packages_packages"."package_type" = 2 
AND "packages_packages"."name" = '@testing-group/hellonpm' 
AND "packages_packages"."status" IN (0, 1) 
AND "packages_packages"."id" IN 
  (SELECT MAX(id) AS id 
  FROM "packages_packages" 
  WHERE "packages_packages"."project_id" IN 
    (SELECT "projects"."id" 
    FROM "projects" 
    WHERE "projects"."namespace_id" IN 
      (SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id 
      FROM "namespaces" 
      WHERE "namespaces"."type" = 'Group' 
      AND (traversal_ids @> ('{39}'))
      )
    ) 
  AND "packages_packages"."package_type" = 2 
  AND "packages_packages"."name" = '@testing-group/hellonpm' 
  AND "packages_packages"."status" IN (0, 1) 
  GROUP BY "packages_packages"."version"
  )

This MR simplifies the query to:

EXPLAIN SELECT "packages_packages".* 
FROM "packages_packages" 
WHERE "packages_packages"."id" IN 
  (SELECT MAX(id) AS id 
  FROM "packages_packages" 
  WHERE "packages_packages"."project_id" IN 
    (SELECT "projects"."id" 
    FROM "projects" 
    WHERE "projects"."namespace_id" IN 
      (SELECT namespaces.traversal_ids[array_length(namespaces.traversal_ids, 1)] AS id 
      FROM "namespaces" 
      WHERE "namespaces"."type" = 'Group' 
      AND (traversal_ids @> ('{39}')))) 
  AND "packages_packages"."package_type" = 2 
  AND "packages_packages"."name" = '@testing-group/hellonpm' 
  AND "packages_packages"."status" IN (0, 1) 
  GROUP BY "packages_packages"."version")

The query to get the packages within a project is also changed similarly.

Original query:

SELECT "packages_packages".* 
FROM "packages_packages" 
WHERE "packages_packages"."project_id" = 10 
AND "packages_packages"."package_type" = 2 
AND "packages_packages"."name" = 'foofoofoo' 
AND "packages_packages"."status" IN (0, 1) 
AND "packages_packages"."id" IN 
  (SELECT MAX(id) AS id 
  FROM "packages_packages" 
  WHERE "packages_packages"."project_id" = 10 
  AND "packages_packages"."package_type" = 2 
  AND "packages_packages"."name" = 'foofoofoo' 
  AND "packages_packages"."status" IN (0, 1) 
  GROUP BY "packages_packages"."version")

New query:

SELECT "packages_packages".* 
FROM "packages_packages" 
WHERE "packages_packages"."id" IN 
  (SELECT MAX(id) AS id 
  FROM "packages_packages" 
  WHERE "packages_packages"."project_id" = 10 
  AND "packages_packages"."package_type" = 2 
  AND "packages_packages"."name" = 'foofoofoo' 
  AND "packages_packages"."status" IN (0, 1) 
  GROUP BY "packages_packages"."version")

Context

We discovered a query performance issue during the rollout of a bugfix. We weren't encountering this prior to the bugfix, because the bugfix was only querying for packages inside one project. The bugfix now correctly queries for packages across all projects within the namespace, but it turned out that this was an expensive query.

The MR is behind the same feature flag, npm_allow_packages_in_multiple_projects as the bugfix because the slow query only gets run when the bugfix is in place.

Screenshots or screen recordings

Performance analysis of instance scope queries

Original query, instance scope: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/16411/commands/56362

Time: 16.296 s
  - planning: 12.026 ms
  - execution: 16.284 s
    - I/O read: 13.346 s
    - I/O write: 0.000 ms

Shared buffers:
  - hits: 228713 (~1.70 GiB) from the buffer pool
  - reads: 6547 (~51.10 MiB) from the OS file cache, including disk I/O
  - dirtied: 297 (~2.30 MiB)
  - writes: 0

New query, instance scope: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/16416/commands/56370

Time: 1.694 s
  - planning: 10.677 ms
  - execution: 1.683 s
    - I/O read: 1.599 s
    - I/O write: 0.000 ms

Shared buffers:
  - hits: 9895 (~77.30 MiB) from the buffer pool
  - reads: 6543 (~51.10 MiB) from the OS file cache, including disk I/O
  - dirtied: 297 (~2.30 MiB)
  - writes: 0

We don't have performance analysis data for project scope queries as these queries are not causing the SQL timeout errors we've seen for the instance scope query.

How to set up and validate locally

We can follow the same validation for this bugfix which is behind the same feature flag.

  1. Setup
  • create a group named mygroup
  • create a project named project1 under mygroup with public visibility
  • create a project named project2 under mygroup with public visibility
  • publish an npm package named @mygroup/foo version 1.0.0 under project1
  • publish an npm package named @mygroup/foo version 2.0.0 under project2
  • create a personal access token
  • add the instance endpoint with the personal access token to .npmrc
  1. With the :npm_allow_packages_in_multiple_projects feature flag disabled:

npm i @mygroup/foo@2.0.0 installs successfully

npm i @mygroup/foo@1.0.0 fails with a "No matching version found" response

  1. With the :npm_allow_packages_in_multiple_projects feature flag enabled:

npm i @mygroup/foo@2.0.0 installs successfully

npm i @mygroup/foo@1.0.0 installs successfully

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #395745

Edited by Radamanthus Batnag

Merge request reports