Support multiple upstreams in virtual registries file requests
🪜 Context
The maven virtual registry is an upcoming feature that will place the GitLab instance as a proxy for package managers clients, in this case maven clients.
In other words, maven clients will not reach upstream registries directly instead, they will reach the GitLab instance where a virtual registry is configured with an upstream object pointing to the upstream registry url. When a file is requested, the GitLab instance will get it from the upstream stream and send it back to the client. While doing so, we will also cache the file on object storage, this way, any further request asking for the exact same file will be served entirely from the GitLab virtual registry without even pinging the upstream registry.
Up to now, for simplicity reasons, we worked with the constraint: 1 virtual registry has 1 upstream. Now, the first version implementation is at a point where we can bump that constraint and we will aim for 1 virtual registry has 20 (max) upstreams.
This is issue Maven virtual registry: implement multiple upst... (#525112 - closed) • David Fernandez, Moaz Khalifa • 18.0 • On track.
Now the challenge here is that the virtual registry will receive a request like I want file /foo/bar/my_package/pkg.pom and we need to query all upstreams to know which one holds that file. On top of that, the set of upstreams is an ordered list. The order has a meaning here: it dictates which upstreams to ping first. Apart from that, this is all done while a client waits for a file, thus execution time matters here.
Note that we don't need to actually download the file, this part is handled later in the logic (using the workhorse send_dependency logic). That is why we use HEAD requests which are more lightweight. However, the complexity here is that we have to go from HEAD a single upstream to HEAD multiple upstreams (20 max). How can we do this in an efficient manner.
This part being a bit challenging, we investigated the possible implementations in #516087 (comment 2351858258). Our result is that we can leverage typhoeus gem with the hydra mode to parallelize requests (this part is backed by libcurl multi).
In a previous MR, we updated the models to switch from registry 1:1 upstream to registry 1:n upstreams.
This MR is the follow up that will update the file request handling to properly query all upstreams attached to a registry.
Lastly, note that the maven virtual registry is still behind a wip feature flag.
🤔 What does this MR do and why?
- Introduce a new service
ee/app/services/virtual_registries/check_upstreams_service.rbthat will check all the upstrams of a registry given a file path. If an upstream has that file, it will be returned. - Update
ee/app/services/virtual_registries/packages/maven/handle_file_request_service.rbto use the new service. - Update/Create all the related specs.
- Update the feature spec to add a case with multiple upstreams.
The feature is behind a feature flag, thus we don't have a changelog here.
📚 References
- Maven virtual registry MVC (API only interactions) (&14137 - closed) • David Fernandez, Moaz Khalifa • 18.1
- Maven virtual registry: implement multiple upst... (#525112 - closed) • David Fernandez, Moaz Khalifa • 18.0 • On track
🖥️ Screenshots or screen recordings
None
🧑🔬 How to set up and validate locally
- Enable the feature flag :
Feature.enable(:virtual_registry_maven) - Have a top level group, a PAT (scope
api) and a GitLab instance with an EE license ready.
Let's create a registry (and note the id):
$ curl -X POST --header "Private-Token: <PAT>" "http://gdk.test:8000/api/v4/groups/<group_id>/-/virtual_registries/packages/maven/registries" | jq
We're going to create 4 upstreams that will follow this order:
- https://gitlab.com/issue-reproduce/packages/maven/parent-group/subgroup1/project1/-/packages
- https://gitlab.com/issue-reproduce/packages/maven/parent-group/subgroup2/project2/-/packages
- https://gitlab.com/issue-reproduce/packages/maven/maven-package/-/packages
- Maven central (the official public registry)
The 4 upstreams are public, thus they don't need any credentials to be accessed.
Let's create the upstreams:
$ curl -X POST --header "Private-Token: <PAT>" "http://gdk.test:8000/api/v4/virtual_registries/packages/maven/registries/<registry id>/upstreams?url=https://gitlab.com/api/v4/projects/31975845/packages/maven" | jq
$ curl -X POST --header "Private-Token: <PAT>" "http://gdk.test:8000/api/v4/virtual_registries/packages/maven/registries/<registry id>/upstreams?url=https://gitlab.com/api/v4/projects/31975827/packages/maven" | jq
$ curl -X POST --header "Private-Token: <PAT>" "http://gdk.test:8000/api/v4/virtual_registries/packages/maven/registries/<registry id>/upstreams?url=https://gitlab.com/api/v4/projects/17012483/packages/maven" | jq
$ curl -X POST --header "Private-Token: <PAT>" "http://gdk.test:8000/api/v4/virtual_registries/packages/maven/registries/<registry id>/upstreams?url=https://repo1.maven.org/maven2" | jq
Let's check the upstreams on the registry. You should see the 4 upstreams in the exact same order that we created them.
$ curl --header "Private-Token: <PAT>" "http://gdk.test:8000/api/v4/virtual_registries/packages/maven/registries/<registry id>/upstreams" | jq
All good, now, we're going to build a small maven application that will ask for packages that only exist in one of the upstreams. We're going to point to the virtual registry and ask $ mvn to pull the dependencies.
- Have a working
$ mvninstallation
In a folder, add these files
`pom.xml`
<?xml version="1.0" encoding="UTF-8" ?>
<project
xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd"
>
<modelVersion>4.0.0</modelVersion>
<groupId>org.sandbox</groupId>
<artifactId>test</artifactId>
<version>3.4.0</version>
<name>test</name>
<properties>
<java.version>17</java.version>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
</properties>
<dependencies>
<dependency>
<groupId>gl.pru</groupId>
<artifactId>my-bananas</artifactId>
<version>1.2.3</version>
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-client</artifactId>
<version>12.0.19</version>
</dependency>
<dependency>
<groupId>my.group.id</groupId>
<artifactId>Asoka</artifactId>
<version>9.7.9</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents.client5</groupId>
<artifactId>httpclient5</artifactId>
<version>5.4.3</version>
</dependency>
</dependencies>
</project>
`settings.xml`
<settings>
<mirrors>
<mirror>
<id>gitlab-maven</id>
<name>GitLab proxy of central repo</name>
<url>http://gdk.test:8000/api/v4/virtual_registries/packages/maven/<registry_id></url>
<mirrorOf>central</mirrorOf>
</mirror>
</mirrors>
<servers>
<server>
<id>gitlab-maven</id>
<configuration>
<httpHeaders>
<property>
<name>Private-Token</name>
<value><PAT></value>
</property>
</httpHeaders>
</configuration>
</server>
</servers>
</settings>
Note that the jetty dependency is present in the second upstream and maven central but because the second upstream is present before maven central in the list, that dependency should be pulled from the second upstream.
Run rm -rf ~/.m2 to make sure that you don't pull dependencies from the local cache.
Let's pull the dependencies with:
$ mvn compile -s settings.xml
Let's examine the virtual registry and the upstreams in a rails console:
r = VirtualRegistries::Packages::Maven::Registry.find(<registry_id>)
# let's check the first upstream, it should have cached the file requests for a single dependency (in maven's world, a dependency is at least 2 files)
r.upstreams[0].cache_entries.count
=> 2 # good
# let's check the second upstream, same thing
r.upstreams[1].cache_entries.count
=> 2 # good
# The second upstream should have the jetty client in its cache, even though this dependency exists in maven central (last upstream)
r.upstreams[1].cache_entries.first.relative_path
=> "/org/eclipse/jetty/jetty-client/12.0.19/jetty-client-12.0.19.jar" # good
# let's check the third upstream, same thing
r.upstreams[2].cache_entries.count
=> 2 # good
# lastly, maven central is the last upstream that should be used for all dependencies not found in the first 3 upstreams
r.upstreams[3].cache_entries.count
=> 113 # good
Given the above, the multiple upstreams handling behaves as expected
🏎️ MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.