Skip to content

Add hidden status to packages

Steve Abrams requested to merge 290741-packages-hidden-status into master

What does this MR do?

Adds a new column status to the packages_packages table. The column is an enum that will hold various statuses that a given package could have. To start with, we are adding the :default, :hidden, and :processing statuses.

Only :default packages are returned in the API index endpoints and in the list pages of the UI (populated from these APIs) unless other statuses are specifically specified using the status param.

The first package type to make use of this new column and :hidden status is the Generic Package type. This MR allows the status param to be used when creating Generic packages.

The processing status is not yet used in this MR, but will be used in the implementation of RubyGems packages (currently in progress) when we process the incoming gem and extract metadata.

Database

Migrations

Up migrations

== 20210204152257 AddStatusToPackagesPackages: migrating ======================
-- add_column(:packages_packages, :status, :smallint, {:default=>0, :null=>false})
   -> 0.0073s
== 20210204152257 AddStatusToPackagesPackages: migrated (0.0075s) =============

== 20210209171525 AddStatusIndexToPackagesPackages: migrating =================
-- transaction_open?()
   -> 0.0000s
-- index_exists?(:packages_packages, [:project_id, :status], {:name=>"index_packages_packages_on_project_id_and_status", :algorithm=>:concurrently})
   -> 0.0103s
-- execute("SET statement_timeout TO 0")
   -> 0.0098s
-- add_index(:packages_packages, [:project_id, :status], {:name=>"index_packages_packages_on_project_id_and_status", :algorithm=>:concurrently})
   -> 0.1259s
-- execute("RESET ALL")
   -> 0.0011s
== 20210209171525 AddStatusIndexToPackagesPackages: migrated (0.2294s) ========

Down migrations

== 20210209171525 AddStatusIndexToPackagesPackages: reverting =================
-- transaction_open?()
   -> 0.0000s
-- indexes(:packages_packages)
   -> 0.0056s
-- execute("SET statement_timeout TO 0")
   -> 0.0005s
-- remove_index(:packages_packages, {:algorithm=>:concurrently, :name=>"index_packages_packages_on_project_id_and_status"})
   -> 0.0028s
-- execute("RESET ALL")
   -> 0.0005s
== 20210209171525 AddStatusIndexToPackagesPackages: reverted (0.0103s) ========

== 20210204152257 AddStatusToPackagesPackages: reverting ======================
-- remove_column(:packages_packages, :status, :smallint, {:default=>0, :null=>false})
   -> 0.0066s
== 20210204152257 AddStatusToPackagesPackages: reverted (0.0209s) =============

Scope queries:

Both the packages_finder and group_packages_finder will always apply either the displayable scope, or with_status, and will always be looking at packages within a given project_id or set of project_ids. Most packages will have the default status: status = 0, so adding an index where status = 0 does not really improve the query time since we will still have a large number of packages to scan or iterate over to apply any other filters. Adding a general index on [:project_id, :status] does improve queries for non-default statuses. The finders paginate with a limit of 20, these example queries are the "base query" of the finder (no params included). This is the query that will run if you visit the package registry page for a project or group. Adding params to filter further, such as by package name, will result in using existing indexes such as (project_id, package_type) and (project_id, name, version, package_type).

This thread also contains some details of how various other indexes were tried and how we ended up adding (project_id, status).

GroupPackagesFinder

This finder contains a pre-existing low performing query. See #321590, which has been opened to look into investigating and optimizing this query.

Here we look at a group that contains a project with a large number of packages (20000+) to get an idea of the timings on the higher end.

Base query:

SELECT "packages_packages".*
FROM "packages_packages"
WHERE "packages_packages"."project_id" IN
    (SELECT "projects"."id"
     FROM "projects"
     WHERE "projects"."namespace_id" IN
         (WITH RECURSIVE "base_and_descendants" AS (
                                                      (SELECT "namespaces".*
                                                       FROM "namespaces"
                                                       WHERE "namespaces"."id" = 1)
                                                    UNION
                                                      (SELECT "namespaces".*
                                                       FROM "namespaces",
                                                            "base_and_descendants"
                                                       WHERE "namespaces"."parent_id" = "base_and_descendants"."id")) SELECT "id"
          FROM "base_and_descendants" AS "namespaces") )
  AND ("packages_packages"."package_type" != 4
       OR "packages_packages"."name" != 'NuGet.Temporary.Package')
  AND "packages_packages"."status" = 0
ORDER BY created_at ASC
LIMIT 20 OFFSET 0
Where status = 0 (default):

The index does not effect when status = 0

Explain: https://console.postgres.ai/gitlab/gitlab-production-tunnel/sessions/2270/commands/7037

Where status = 1:

Explain before: https://console.postgres.ai/gitlab/gitlab-production-tunnel/sessions/2261/commands/7024

Explain after: https://console.postgres.ai/gitlab/gitlab-production-tunnel/sessions/2270/commands/7051

PackagesFinder

The PackagesFinder works within the scope of a single project. Most of the time these queries are within the performance guidelines as most projects have less than 1000 packages, only a few have greater than 10000. To see the extreme case here and how an index may or may not help, we look at a project that has ~20000 packages with a few thousand updated to non-default statuses.

Base query:

SELECT "packages_packages".*
FROM "packages_packages"
WHERE "packages_packages"."project_id" = 15288597
  AND ("packages_packages"."package_type" != 4
       OR "packages_packages"."name" != 'NuGet.Temporary.Package')
  AND "packages_packages"."status" = 0 -- this is the aspect of the query that has changed
ORDER BY created_at ASC
LIMIT 20 OFFSET 0
With status = 0 (default)

The index does not effect when status = 0

Explain: https://console.postgres.ai/gitlab/gitlab-production-tunnel/sessions/2270/commands/7041

With status = 1

Explain before: - https://console.postgres.ai/gitlab/gitlab-production-tunnel/sessions/2270/commands/7043

Explain after: - https://console.postgres.ai/gitlab/gitlab-production-tunnel/sessions/2270/commands/7055

Since we are sorting by created_at by default, when we have a large enough number of packages of varying statuses, the planner will not use the new index, but will use the (project_id, created_at) index instead. If we only have a few packages with a non-default package, the planner will make the choice to use the new index, resulting in a faster query:

Explain with smaller non-default packages and new index: https://console.postgres.ai/gitlab/gitlab-production-tunnel/sessions/2304/commands/7150

Screenshots (strongly suggested)

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • [-] Label as security and @ mention @gitlab-com/gl-security/appsec
  • [-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • [-] Security reports checked/validated by a reviewer from the AppSec team

Related to #290741 (closed)

Edited by Steve Abrams

Merge request reports