Skip to content

Sort maven metadata results by recent file

Steve Abrams requested to merge 285334-maven-cross-group-sort into master

🏛 Context

Maven (Java) packages are made up of a variety of files. Each package has a maven-metadata.xml file that is unique not to the specific package version, but is used to store metadata for all versions. So if I have package Foo with versions 1.0, 2.0, 2.1, and 3.0, there is a maven-metadata.xml file that describes all versions of the package. So you could say the maven-metadata.xml file is a "versionless" file that belongs to many versions of the same package.

In the GitLab package registry each version is it's own Package record. Every time a user publishes a new Maven package, their local Maven client will update the maven-metadata.xml file, and upload it as part of the package. Because this file has no version attached to it, it is assigned to a versionless package record in the GitLab database. As more package versions are published, more maven-metadata.xml files are uploaded to the versionless package.

When a user installs a Maven package, the Maven client first requests the maven-metadata.xml to understand what versions are available. Currently, the way GitLab finds this file is we find the matching package with no version and return the last uploaded maven-metadata.xml file.

There are occasions where customers have had different versions of the same package uploaded to different projects, sometimes in different subgroups. This means that for each project, a versionless package will exist with the maven-metadata.xml files that were included when those versions were published. The problem is, if a user searches the group for a package (using Maven's group-level endpoint), we may or may not return the most recent maven-metadata.xml file.

🔎 What does this MR do?

Currently, in this situation, projects are sorted by ID (by default), so the last matching versionless package will be used to fetch maven-metadata.xml file. But if it is not the last one to have had a file uploaded to it, it will have an older version of the maven-metadata.xml file.

This MR updates the Maven package finder so when we are dealing with versionless packages, we don't just return the .last package, but we return the package with the most recently created PackageFile.

🐘 Database

A new scope: .order_by_package_file is introduced.

This test query finds a package that is nested several namespaces deep.

Query without the scope:

Explain plan: https://postgres.ai/console/shared/831f855a-7489-4627-8a5c-c7d475d98cf0

SELECT "packages_packages".* 
FROM "packages_packages" 
INNER JOIN "packages_maven_metadata" ON "packages_maven_metadata"."package_id" = "packages_packages"."id" 
INNER JOIN "packages_package_files" ON "packages_package_files"."package_id" = "packages_packages"."id" 
WHERE "packages_packages"."project_id" IN (
  SELECT "projects"."id" 
  FROM "projects" 
  WHERE "projects"."namespace_id" in (
    WITH RECURSIVE "base_and_descendants" AS ((SELECT "namespaces".* FROM "namespaces" WHERE "namespaces"."id" = 785414)
    UNION
    (SELECT "namespaces".* FROM "namespaces", "base_and_descendants" WHERE "namespaces"."parent_id" = "base_and_descendants"."id")) SELECT "id" FROM "base_and_descendants" AS "namespaces"
  ) AND (
    EXISTS (
      SELECT 1 FROM "project_authorizations" 
      WHERE "project_authorizations"."user_id" = 3983112
      AND (project_authorizations.project_id = projects.id)
    ) 
    OR projects.visibility_level IN (0,10,20)
  )
) 
AND "packages_maven_metadata"."path" = 'com/sandbox/bananas'
ORDER BY "packages_packages"."id" DESC 
LIMIT 1;

Time: 17.614 ms
  - planning: 2.582 ms
  - execution: 15.032 ms
    - I/O read: N/A
    - I/O write: N/A

Shared buffers:
  - hits: 5027 (~39.30 MiB) from the buffer pool
  - reads: 0 from the OS file cache, including disk I/O
  - dirtied: 0
  - writes: 0

Query with the scope:

Explain plan: https://postgres.ai/console/shared/ab7926e4-d7b9-497e-8236-5bc9a7deed4a

The scope changes final ORDER BY and adds an additional JOIN against packages_package_files

SELECT "packages_packages".* 
FROM "packages_packages" 
INNER JOIN "packages_maven_metadata" ON "packages_maven_metadata"."package_id" = "packages_packages"."id" 
INNER JOIN "packages_package_files" ON "packages_package_files"."package_id" = "packages_packages"."id" 
WHERE "packages_packages"."project_id" IN (
  SELECT "projects"."id" 
  FROM "projects" 
  WHERE "projects"."namespace_id" in (
    WITH RECURSIVE "base_and_descendants" AS ((SELECT "namespaces".* FROM "namespaces" WHERE "namespaces"."id" = 785414)
    UNION
    (SELECT "namespaces".* FROM "namespaces", "base_and_descendants" WHERE "namespaces"."parent_id" = "base_and_descendants"."id")) SELECT "id" FROM "base_and_descendants" AS "namespaces"
  ) AND (
    EXISTS (
      SELECT 1 FROM "project_authorizations" 
      WHERE "project_authorizations"."user_id" = 3983112
      AND (project_authorizations.project_id = projects.id)
    ) 
    OR projects.visibility_level IN (0,10,20)
  )
) 
AND "packages_maven_metadata"."path" = 'com/sandbox/bananas'
ORDER BY packages_package_files.created_at DESC
LIMIT 1;

Time: 25.130 ms
  - planning: 2.939 ms
  - execution: 22.191 ms
    - I/O read: N/A
    - I/O write: N/A

Shared buffers:
  - hits: 5779 (~45.10 MiB) from the buffer pool
  - reads: 0 from the OS file cache, including disk I/O
  - dirtied: 0
  - writes: 0

📸 Screenshots (strongly suggested)

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • [-] Label as security and @ mention @gitlab-com/gl-security/appsec
  • [-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • [-] Security reports checked/validated by a reviewer from the AppSec team

Related to #285334 (closed)

Edited by Steve Abrams

Merge request reports