Skip to content

Maven Dependency proxy: cache miss path

David Fernandez requested to merge 410719-dependency-proxy-cache-miss into master

🔭 Context

We're working on the very first version of the dependency proxy for packages. See #407460 (comment 1373731852) for all the details from the technical investigation.

At the core, the concept is right simple. GitLab will act with as a proxy. Basically, users can pull packages through it and GitLab will be a pull-through cache.

Package Manager clients (npm, maven, ...) <-> GitLab <-> External Package Registry

Because, GitLab is in the middle (aka proxy) of the package transport, we can leverage the GitLab Package registry to use it as a cache. In other words, before contacting the external package registry, we can check the local project registry to check if the package is already there. If that's the case, we can return it directly.

The first iteration will only cover the Maven package format.

In Add the API endpoint for the Maven dependency p... (!123491 - merged), we added the base class for the API endpoints. In Maven Dependency proxy: cache hit path (!129495 - merged), we added the logic of what happens when the requested file is present in the package registry.

This MR aims to fill part of the core logic: response accordingly when the requested file is not present in the package registry. The related issue is The Maven dependency proxy API: cache miss path (#410719 - closed).

🎯 The cache miss path

As its core, the idea is very simple. We are in the cache miss path, meaning that the requested file is not in the package registry. In this case, we're going to pull the file from the remote registry and upload it to the local package registry at the same time. Easy, right?

Well, the main challenge is that not all users can create files in the package registry. You need to be developer for that. Ok but reporter can also access the dependency proxy. So, how do we solve this? Well, we dissociate the dependency proxy access from the package registry creation and:

  1. If a user can access the dependency proxy but not create a file in the package registry, we will simply proxy the file from the remote registry.
  2. If a user can access the dependency proxy and create a file in the package registry, we will pull the file from the remote registry, return it to the client and upload it to the package registry at the same time.

For those with observation skills, in (2.) we have two complete different operations happening at the same time, how is that possible? Read on.

🐎 Workhorse Send Dependency

The rails backend can send instructions to workhorse regarding external urls. The most obvious one is send_url. Basically, workhorse will fetch the file located at the url and send it back to the client. That's actually what we use for (1.) above. You didn't think that we would download the file from the remote registry to send it to the client, all of this on the rails side, right? 😸

We also have send_dependency. It's a similar command but workhorse will use a tee while reading the file located at the url. With that tee, it will "pipe" the output into:

  • the client (similar to what send_url does).
  • an (direct) upload endpoint.

That's how we achieve to do two operations at the same time.

Now, regarding the upload part, send_dependency has fixed settings and we will need to support options on those settings. We centralized them in a structure called UploadConfig that will have:

  • method. By default, workhorse will use POST but in this MR the Maven package registry, needs a PUT.
  • url. By default, workhorse will simply take the original request url and append an /upload to it. The problem here is that the dependency proxy for Maven packages and the Maven package registry endpoints are mounted in different namespaces in the url. As such, we will use this field to target the Maven package registry upload endpoint.
  • headers. By default, workhorse will re-use all headers from the original request. We need to be able to set custom headers here to solve an authentication issue.

🛃 Authentication

We have a bit of a challenge regarding authentcation. To put plainly, the dependency proxy accept more credentials transports than the upload endpoint of the Maven package registry. Here it is:

  1. Dependency proxy accepts custom http headers (Private-Token and friends) and basic auth.
  2. The upload endpoint accepts custom http headers only.

As we said above, send_dependency by default will re-use the headers of the original request. Meaning that if users use custom http headers, the upload endpoint will receive those as usual. Great 🎉

If users use basic auth, the upload endpoint will receive these too but not accept them 💥

Thus, in the dependency proxy, we need to detect the basic auth usage and translate those credentials to the custom http headers. Hence, we need to have control on what headers the upload endpoint receive.

Logic flow

This can help to understand the dependency proxy implementation.

Last piece you need here is that when the file exists locally, we have a behavior that is similar to the dependency proxy of container images: before sending the local file, we run a sanity check. We get the Etag from the remote file and compare it to the digests we have for the local file. If they don't match, then we know that the local file should be discarded and we should pull and upload the remote file again.

Here is the main logic branches:

As you can see, most of the logic was implemented in Maven Dependency proxy: cache hit path (!129495 - merged). This MR completes the implementation.

🔍 What does this MR do and why?

  • Update the send_dependency logic in both rails and workhorse so that we can control aspects of the upload part: the url, the method and the headers.
  • Update the dependency proxy to properly serve and upload a remote file.
  • Update the related specs.
  • Add complete feature specs for the dependency proxy. These are useful as they will run a rails and a workhorse backend. It allows us to verify that all changes (rails and workhorse) are working properly together.

The entire dependency proxy implementation is gated behind a feature flag. Rollout issue: [Feature flag] Enable packages_dependency_proxy... (#415218 - closed).

🖼 Screenshots or screen recordings

Well you know the dependency proxy is a single API endpoint: to pull a specific file. So, no UI. 😸

How to set up and validate locally

Have a project and at least PATs of a maintainer and a reporter.

Let's setup the dependency proxy settings first. There is no UI yet, so we're going to do it using a Rails console.

For the target remote registry, we're going to use this private project on gitlab.com: https://gitlab.com/issue-reproduce/packages/maven/maven-private-project. You will need a username + PAT to access that project. You can also use a public gitlab.com project or even maven central itself.

In a rails console:

Project.find(<local_project_id>).create_dependency_proxy_packages_setting!(maven_external_registry_url: 'https://gitlab.com/api/v4/projects/22780791/packages/maven', maven_external_registry_username: '<username for gitlab.com>', maven_external_registry_password: '<pat for gitlab.com>', enabled: true)

While at it, let's enable the feature flag:

Feature.enable(:packages_dependency_proxy_maven)

We're ready to play around with the dependency proxy. We're not going to use a real maven client (such as mvn or gradle) but use curl to quickly check the behavior of the endpoint.

1️⃣ Package doesn't exist on the remote registry

With a maintainer:

$ curl -vvv "http://<local maintainer username>:<local maintainer pat>@gdk.test:8000/api/v4/projects/<project_id>/dependency_proxy/packages/maven/i/dont/exist/nop.pom" 

(truncated logs)
< HTTP/1.1 404 Not Found

The send_dependency operation tried to fetch the remote file but end up with 404. The entire workhorse logic is stopped with that status code.

With a reporter:

$ curl -vvv "http://<local reporter username>:<local reporter pat>@gdk.test:8000/api/v4/projects/<project_id>/dependency_proxy/packages/maven/i/dont/exist/nop.pom"

(truncated logs)
< HTTP/1.1 404 Not Found

Here, we use send_url and which will return the response from the remote registry. The file doesn't exist and we logically reply with 404 Not found

2️⃣ Package exists on the remote registry but not local one

With a maintainer:

$ curl "http://<local maintainer username>:<local maintainer pat>@gdk.test:8000/api/v4/projects/<project_id>/dependency_proxy/packages/maven/gl/pru/My.Ananas/13.0.3/My.Ananas-13.0.3.pom"
<?xml version="1.0" encoding="UTF-8"?>


<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
(truncated output)

We get the file contents. Good . Check in a rails console:

Project.find(<project_id>).packages.maven
=> [#<Packages::Package:0x0000000126037d58 id: 295, project_id: 223, created_at: Thu, 31 Aug 2023 19:09:42.666705000 UTC +00:00, updated_at: Thu, 31 Aug 2023 19:09:42.666705000 UTC +00:00, name: "gl/pru/My.Ananas", version: "13.0.3", package_type: "maven", creator_id: 1, status: "default", last_downloaded_at: nil, status_message: nil>]

A package has been created in the package registry. This is the proof that send_dependency has been properly used.

Let's revert to the original state with Project.find(<project_id>).packages.destroy_all

With a reporter:

$ curl "http://<local reporter username>:<local reporter pat>@gdk.test:8000/api/v4/projects/<project_id>/dependency_proxy/packages/maven/gl/pru/My.Ananas/13.0.3/My.Ananas-13.0.3.pom"
<?xml version="1.0" encoding="UTF-8"?>


<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
(truncated output)

The file contents is returned .

Let's check the rails console:

Project.find(<project_id>).packages.maven
=> []

No new packages have been created. This is because the reporter can't create them so we used send_url to send the contents of the file from the remote registry.

3️⃣ File is on both registries but the etag doesn't match

For this part, simply pull the file with the maintainer user. That action will publish the file to the package registry. You can verify it with a rails console. While at it, update the digests of the file so that it will not match the Etag:

Project.find(<project_id>).packages.maven
=> [#<Packages::Package:0x00000001275fe768 id: 296, project_id: 223, created_at: Thu, 31 Aug 2023 19:14:12.794491000 UTC +00:00, updated_at: Thu, 31 Aug 2023 19:14:12.794491000 UTC +00:00, name: "gl/pru/My.Ananas", version: "13.0.3", package_type: "maven", creator_id: 1, status: "default", last_downloaded_at: nil, status_message: nil>]

Project.find(<project_id>).packages.maven.first.package_files.first.update!(file_md5: "1234567890")

Let's see what happens with a maintainer:

$ curl "http://<local maintainer username>:<local maintainer pat>@gdk.test:8000/api/v4/projects/223/dependency_proxy/packages/maven/gl/pru/My.Ananas/13.0.3/My.Ananas-13.0.3.pom"        
<?xml version="1.0" encoding="UTF-8"?>


<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
(truncated output)

Check in the rails console:

Project.find(<project_id>).packages.maven.first.package_files
=> [#<Packages::PackageFile:0x00000001304b5538
  id: 511,
  package_id: 296,
  created_at: Thu, 31 Aug 2023 19:19:18.144712000 UTC +00:00,
  updated_at: Thu, 31 Aug 2023 19:19:18.144712000 UTC +00:00,
  size: 1315,
  file_store: 1,
  file_md5: "96b36764ee82a136ce6e6168b89a7c3c",
  file_sha1: "3aa2475b534886fb0da594dcbe7cca1b9e455f28",
  file_name: "My.Ananas-13.0.3.pom",
  file: "My.Ananas-13.0.3.pom",
  file_sha256: nil,
  verification_retry_at: nil,
  verified_at: nil,
  verification_failure: nil,
  verification_retry_count: nil,
  verification_checksum: nil,
  verification_state: 0,
  verification_started_at: nil,
  status: "default",
  new_file_path: nil>,
 #<Packages::PackageFile:0x00000001304b53a8
  id: 510,
  package_id: 296,
  created_at: Thu, 31 Aug 2023 19:14:12.821174000 UTC +00:00,
  updated_at: Thu, 31 Aug 2023 19:16:58.028187000 UTC +00:00,
  size: 1315,
  file_store: 1,
  file_md5: "1234567890",
  file_sha1: "3aa2475b534886fb0da594dcbe7cca1b9e455f28",
  file_name: "My.Ananas-13.0.3.pom",
  file: "My.Ananas-13.0.3.pom",
  file_sha256: nil,
  verification_retry_at: nil,
  verified_at: nil,
  verification_failure: nil,
  verification_retry_count: nil,
  verification_checksum: nil,
  verification_state: 0,
  verification_started_at: nil,
  status: "pending_destruction",
  new_file_path: nil>]

Notice how we have two similar files. Check the status column, one is pending destruction and the other is available (status default). This is the expected behavior. Recall the sanity check we do when the requested file exists in the local package registry. Well when that check fails and the user can create files in the package registry, then we will discard the existing file (mark it as pending_destruction) and use send_dependency again to pull and upload a new version of the file to the local package registry.

Let's see what happens with the same conditions (package exists local with wrong etag) with a reporter:

$ curl "http://<local reporter username>:<local reporter pat>@gdk.test:8000/api/v4/projects/<project_id>/dependency_proxy/packages/maven/gl/pru/My.Ananas/13.0.3/My.Ananas-13.0.3.pom"
<?xml version="1.0" encoding="UTF-8"?>


<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
(truncated output)

In a rails console:

Project.find(<project_id>).packages.maven.first.package_files
=> [#<Packages::PackageFile:0x0000000130dd4650
  id: 512,
  package_id: 297,
  created_at: Thu, 31 Aug 2023 19:23:05.163041000 UTC +00:00,
  updated_at: Thu, 31 Aug 2023 19:23:13.329932000 UTC +00:00,
  size: 1315,
  file_store: 1,
  file_md5: "1234567890",
  file_sha1: "3aa2475b534886fb0da594dcbe7cca1b9e455f28",
  file_name: "My.Ananas-13.0.3.pom",
  file: "My.Ananas-13.0.3.pom",
  file_sha256: nil,
  verification_retry_at: nil,
  verified_at: nil,
  verification_failure: nil,
  verification_retry_count: nil,
  verification_checksum: nil,
  verification_state: 0,
  verification_started_at: nil,
  status: "default",
  new_file_path: nil>]

What happened here? So the sanity check failed but because the user can't write to the package registry, we used send_url to send the remote file back and didn't touch the local file at all.

🔮 Conclusions

  • The dependency proxy behaved properly in all situations.
  • Notice that no matter the conditions, if the user has permission to pull files through the dependency proxy, the requested file will always be returned, either from the remote registry or the local one. It depends on the conditions (eg. ETag check) and the permissions of the user on the Maven package registry.
  • Notice that all our tests above used a basic auth for authentication. This kind of authentication is not supported in the upload endpoint. Yet, everything went smoothly because the dependency proxy transparently translated the credentials into the custom http headers.

🏁 MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by David Fernandez

Merge request reports