Skip to content

Maven Dependency proxy: cache hit path

David Fernandez requested to merge 410717-dependency-proxy-cache-hit into master

🔭 Context

We're working on the very first version of the dependency proxy for packages. See #407460 (comment 1373731852) for all the details from the technical investigation.

At the core, the concept is right simple. GitLab will act with as a proxy. Basically, users can pull packages through it and GitLab will be a pull-through cache.

Package Manager clients (npm, maven, ...) <-> GitLab <-> External Package Registry

Because, GitLab is in the middle (aka proxy) of the package transport, we can leverage the GitLab Package registry to use it as a cache. In other words, before contacting the external package registry, we can check the local project registry to check if the package is already there. If that's the case, we can return it directly.

The first iteration will only cover the Maven package format.

In Add the API endpoint for the Maven dependency p... (!123491 - merged), we added the base class for the API endpoints.

This MR aims to fill part of the core logic: response accordingly when the request file is present in the package registry.

🎯 The cache hit path

The cache hit path is the code path that is executed when the package registry has the requested file. The other path is obviously "the cache miss" but that is not part of this MR. The reason is that this MR is already quite chunky and this is a reasonable work split as the cache miss will need some updates to workhorse (see The Maven dependency proxy API: cache miss path (#410719 - closed))

This path seems easy, right? Locate the file in the package registry and if found, return it. Done.

Well, things are a bit more complex as we want to mimic a sub feature of the dependency proxy for container repositories: a cache that will automatically check itself and discard non coherent entries. At the center of this, this idea is quite simple: before returning the file that we have in GitLab, check on the remote registry that it is the same file. For this, we're going to use simply the etag header.

In other words, before returning the file, we will run a sanity check and HEAD the remote registry. If all is good, the file is returned from the GitLab package registry.

Obviously, if the sanity check fails, then we will remove the package from GitLab and simply re-download the package from the remote registry.

🛃 Permissions

There are two aspects that are handled when we receive a request for a package on the dependency proxy:

  1. Returning the requested package (whether from GitLab or the remote registry).
  2. Manage the cache.

(1.) looks like a read only operation as such follows the read rules of the package registry except for one part: we don't allow anonymous requests. We require users to be authenticated. The minimum role is guest except for private projects where it will be reporter. This is to be inline with the existing package registry permissions.

Now, (2.) is more complex as handling the cache can involve writing (to create a new entry or discarding an existing one). Since our cache is the package registry, we're going to strictly follow the permissions that we have there.

This leaves us with the question: how do we handle users that can do (1.) but not (2.)? That is an excellent question. Well, we're going to check what the user can do with the package registry and the fallback solution will be to return the file from the remote registry.

This seems complex but to help with the understanding of the logic flow, see the next section

Logic flow

  • Search the file in the package registry?
    • We have a matching file. Is the etag coherent with the digests we have?
      • (coherent etag). Return the file from the package registry.
      • (wrong digest). Can the user delete and write to the package registry?
    • (not found). Can the user write to the package registry?

We will note here that to serve the file from the remote registry, we're going to leverage a workhorse's feature: #send_url. This way, we don't even need to download the file on the rails side. Workhorse will do the heavy lifting for us.

🔬 What does this MR do and why?

  • Implement the dependency proxy logic when a package file is found in the package registry.
  • Implement an etag sanity check to verify that the stored package file is coherent with the remote one.
  • Update permissions to not allow anonymous requests.
  • Add and update all the related specs.

The entire dependency proxy implementation is gated behind a feature flag. Rollout issue: [Feature flag] Enable packages_dependency_proxy... (#415218 - closed).

💅 Screenshots or screen recordings

From a client's perspective, the dependency proxy is a single API endpoint 😸 so no UI.

How to set up and validate locally

Have a project and at least PATs of a maintainer and a reporter.

Let's setup the dependency proxy settings first. There is no UI yet, so we're going to do it using a Rails console.

For the target remote registry, we're going to use this private project on gitlab.com: https://gitlab.com/issue-reproduce/packages/maven/maven-private-project. You will need a username + PAT to access that project. You can also use a public gitlab.com project or even maven central itself.

In a rails console:

Project.find(<local_project_id>).create_dependency_proxy_packages_setting!(maven_external_registry_url: 'https://gitlab.com/api/v4/projects/22780791/packages/maven', maven_external_registry_username: '<username for gitlab.com>', maven_external_registry_password: '<pat for gitlab.com>', enabled: true)

While at it, let's enable the feature flag:

Feature.enable(:packages_dependency_proxy_maven)

We're ready to play around with the dependency proxy. We're not going to use a real maven client (such as mvn or gradle) but use curl to quickly check the behavior of the endpoint.

1️⃣ Package doesn't exist on the remote registry

With a maintainer:

$ curl "http://<local maintainer username>:<local maintainer pat>@gdk.test:8000/api/v4/projects/<project_id>/dependency_proxy/packages/maven/i/dont/exist/nop.pom" 
{"message":"202 Accepted"}
  • 202 Accepted is the default response returned when we hit the situation where the backend needs to pull from the remote registry and publish to the local package registry. This part is out of the scope of this MR.

Let's try it again with a reporter:

$ curl "http://<local reporter username>:<local reporter pat>@gdk.test:8000/api/v4/projects/<project_id>/dependency_proxy/packages/maven/i/dont/exist/nop.pom"
{"message":"404 Not Found"}
  • The reporter doesn't have the right to publish to the local package registry, so we still try to send the file directly from the remote registry. Problem: that file doesn't exist, hence the 404

2️⃣ Package exists on the remote registry but not local one

With a maintainer:

$ curl "http://<local maintainer username>:<local maintainer pat>@gdk.test:8000/api/v4/projects/<project_id>/dependency_proxy/packages/maven/gl/pru/My.Ananas/13.0.3/My.Ananas-13.0.3.pom"
{"message":"202 Accepted"}
  • We end up in a similar scenario: having the rights to write to the package registry, the dependency proxy will need to pull the file from the remote registry, publish it to the local registry and send it back to the client. This part is out of the scope of this MR

With a reporter:

$curl "http://<local reporter username>:<local reporter pat>@gdk.test:8000/api/v4/projects/<project_id>/dependency_proxy/packages/maven/gl/pru/My.Ananas/13.0.3/My.Ananas-13.0.3.pom"
<?xml version="1.0" encoding="UTF-8"?>


<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<snip snip>
  • This user can't write to the local package registry but we can still try to pull the file from the remote registry and send it back. As we can see, the file is properly returned

3️⃣ File is on both registries

For this part, we need to upload the exact same files that are on the remote registry. From https://gitlab.com/issue-reproduce/packages/maven/maven-private-project/-/packages/16878679, download the .jar and .pom file.

Create a settings.xml file with:

<settings>
  <servers>
    <server>
      <id>gl</id>
      <configuration>
        <httpHeaders>
          <property>
            <name>Private-Token</name>
            <value><local maintainer pat></value>
          </property>
        </httpHeaders>
      </configuration>
    </server>
  </servers>
</settings>

then upload the files manually:

$ mvn deploy:deploy-file -Durl="http://gdk.test:8000/api/v4/projects/<project_id>/packages/maven" -DrepositoryId="gl" -Dfile="<location of the .jar file>" -DpomFile="<location of the .pom file>" -s settings.xml 

Ok, at this point, we have the exact same files locally and on the remote registry. Let's pull files.

With a maintainer:

$ curl "http://<local maintainer username>:<local maintainer pat>@gdk.test:8000/api/v4/projects/<project_id>/dependency_proxy/packages/maven/gl/pru/My.Ananas/13.0.3/My.Ananas-13.0.3.pom"        
<?xml version="1.0" encoding="UTF-8"?>


<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<snip snip>

With a reporter:

$ curl "http://<local reporter username>:<local reporter pat>@gdk.test:8000/api/v4/projects/<project_id>/dependency_proxy/packages/maven/gl/pru/My.Ananas/13.0.3/My.Ananas-13.0.3.pom"
<?xml version="1.0" encoding="UTF-8"?>


<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<snip snip>
  • This is the happy path when the requested file is in the local registry and the etag is coherent with the local digests: the file is served from the local registry. It's a "pure" read only operation so permission levels don't matter much as long as the user is reporter+ for private projects or guest+ for the others.

4️⃣ File is on both registries but the etag doesn't match

Open a rails console:

Packages::PackageFile.find_by(file_name: 'My.Ananas-13.0.3.pom').update!(file_md5: '12345', file_sha1: '12345', file_sha256: '12345')

With a reporter:

curl "http://<local reporter username>:<local reporter pat>@gdk.test:8000/api/v4/projects/<project_id>/dependency_proxy/packages/maven/gl/pru/My.Ananas/13.0.3/My.Ananas-13.0.3.pom"
<?xml version="1.0" encoding="UTF-8"?>


<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
<snip snip>

With a maintainer:

$ curl "http://<local maintainer username>:<local maintainer pat>@gdk.test:8000/api/v4/projects/223/dependency_proxy/packages/maven/gl/pru/My.Ananas/13.0.3/My.Ananas-13.0.3.pom"        
{"message":"202 Accepted"}
  • When the etag check fails, when need to delete the existing file and pull it again from the remote registry to publish it locally. That happens only if the user has enough permissions. reporter can't do that. As such, we consider the existing file in the local registry as "discarded" and thus, the only solution here is to: pull the file from the remote registry and return it to the client.
  • With the maintainer, things are different, we can interact with the package registry. We mark the package file for destruction and initiate a pull+publish action. This action is out of the scope of this MR and thus return 202 Accepted.

Open a rails console:

Packages::PackageFile.find_by(file_name: 'My.Ananas-13.0.3.pom').status
=> "pending_destruction"

The file is pending destruction

🏁 MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by David Fernandez

Merge request reports