Add remote checksums for Maven package registry and dependency proxy

🖐️ Context

Maven clients (such as $ mvn) don't work with a single file when interacting with registry for a package. Instead, they rely on a step of files (.pom, .jar, maven-metadata.xml for example).

For each file, integrity is provided by additional (related files). For example, .pom.md5 and .pom.sha1 files. Thus, for each file, maven clients will trigger the following web requests (example with a single .pom file):

  • /pkg.pom
  • /pkg.pom.md5
  • /pkg.pom.sha1

Now, not all 3 requests happen all the time. It depends on the running conditions for the given maven clients. However, it is very common to see either of the checksum being requested. Example.

During our work on maven virtual registries (feature to be interacted with maven clients), we stumbled upon this page. In a few words, we can include custom x-* http headers in the response for /pkg.pom and these headers will "transport" the sha1 and md5 checksum. Maven clients will read these headers and completely skip the requests for /pkg.pom.md5 and /pkg.pom.sha1. You can imagine that this leads into saving backend resources (cpu time and database requests saved).

⚔️ Designing the solution

The challenge we have here is that the /pkg.pom request is basically resolving to a file on object storage. As such, we have a few different configurations to handle:

  1. Object storage disabled. The file system is used.
  2. Object storage enabled.
    1. Proxy download enabled. The file is pulled from object storage by GitLab (workhorse) and sent back to the client.
    2. Proxy download disabled. GitLab answers a redirect to a signed url that points to the file on object storage directly. The client will follow that redirect to get the file.

Now, when it comes to setting custom x-* headers to the response for /pkg.pom, we have.

  1. Object storage disabled. We can do it.
  2. Object storage enabled.
    1. Proxy download enabled. We can do it.
    2. Proxy download disabled. Technically impossible as we are limited in how we can instruct object storage providers to send back specific headers.

Thus, for case (2.)(2.), we need to avoid it. The way we're going to do it is forcing the proxy download and thus be in case (2.)(1.).

On gitlab.com, the package registry is already using the proxy download (2.) (1.). Thus, this forcing proxy download would only impact self-managed users that disabled it for the package registry.

🤔 What does this MR do and why?

  • When requesting a file in the Maven package registry, set the x-* headers to send back the checksums along with the file.
    • The proxy download is forced to be enabled if necessary. Eg. proxy_download: false for the Maven package registry is no longer possible.
  • The helper changed in this MR will also impact the Maven dependency proxy which has a similar behavior (returning files to maven clients). There, the object storage configuration of the package registry is used, thus, we have the exact same situation.
  • Adjust the related specs.

The Maven package registry being one of the top most used packages registry on gitlab.com, this change is gated behind a feature flag to provide an additional safety net during its deployment.

References

🏁 MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

🦄 Screenshots or screen recordings

No UI changes.

⚙️ How to set up and validate locally

Have:

  • a project ready.
  • PAT (api scope) ready (maintainer level on the project).
  • A working $ mvn installation.

1️⃣ Maven package registry

pom.xml
<?xml version="1.0" encoding="UTF-8" ?>
<project
    xmlns="http://maven.apache.org/POM/4.0.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"
>
<modelVersion>4.0.0</modelVersion>
<groupId>test</groupId>
<artifactId>test</artifactId>
<version>1.2.3</version>

<dependencies>
    <dependency>
    <groupId>gl.pru</groupId>
    <artifactId>My.Dependency</artifactId>
    <version>1.3.7</version>
    </dependency>
</dependencies>

<repositories>
    <repository>
    <id>gitlab-maven</id>
    <url>http://gdk.test:8000/api/v4/projects/<project_id>/packages/maven</url>
    </repository>
</repositories>
</project>
settings.xml
<settings
    xmlns="http://maven.apache.org/SETTINGS/1.0.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0
                  http://maven.apache.org/xsd/settings-1.0.0.xsd"
>
  <mirrors>
    <mirror>
      <id>maven-default-http-blocker</id>
      <url>http://127.0.0.1/dont-go-here</url>
      <mirrorOf>dummy</mirrorOf>
      <blocked>false</blocked>
    </mirror>
  </mirrors>  
  <servers>
    <server>
      <id>gitlab-maven</id>
      <configuration>
        <httpHeaders>
          <property>
            <name>Private-Token</name>
            <value>***PAT TOKEN HERE***</value>
          </property>
        </httpHeaders>
      </configuration>
    </server>
  </servers>
</settings>

Remove the maven cache for the dependency package:

$ rm -rf ~/.m2/repository/gl/pru

Let's install the dependencies:

$ mvn install -s settings.xml 

On master

With $ tail -f log/development.log | grep "api/v4/projects/<project_id>/packages/maven", we get

Started GET "/api/v4/projects/22/packages/maven/gl/pru/My.Dependency/1.3.7/My.Dependency-1.3.7.pom" for 172.16.123.1 at 2024-12-17 11:28:52 +0100
Started GET "/api/v4/projects/22/packages/maven/gl/pru/My.Dependency/1.3.7/My.Dependency-1.3.7.pom.sha1" for 172.16.123.1 at 2024-12-17 11:28:52 +0100
Started GET "/api/v4/projects/22/packages/maven/gl/pru/My.Dependency/1.3.7/My.Dependency-1.3.7.jar" for 172.16.123.1 at 2024-12-17 11:28:52 +0100
Started GET "/api/v4/projects/22/packages/maven/gl/pru/My.Dependency/1.3.7/My.Dependency-1.3.7.jar.sha1" for 172.16.123.1 at 2024-12-17 11:28:53 +0100

We can see that the file is requested and the sha1 checksum = 4 requests.

With this MR

Started GET "/api/v4/projects/22/packages/maven/gl/pru/My.Dependency/1.3.7/My.Dependency-1.3.7.pom" for 172.16.123.1 at 2024-12-17 11:30:42 +0100
Started GET "/api/v4/projects/22/packages/maven/gl/pru/My.Dependency/1.3.7/My.Dependency-1.3.7.jar" for 172.16.123.1 at 2024-12-17 11:30:43 +0100

As you can see, we only have requests to the file themselves. No more requests to checksums because they have been included in the response along with the file. Only 2 requests (50% reduction) 🎉

2️⃣ Maven dependency proxy

In the project settings,

  • go to Packages and registries

  • enable the dependency proxy

  • set the url to https://repo1.maven.org/maven2/

  • save the changes

  • Create a local folder with these files:

pom.xml
<?xml version="1.0" encoding="UTF-8" ?>
<project
    xmlns="http://maven.apache.org/POM/4.0.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"
>
<modelVersion>4.0.0</modelVersion>
<groupId>test</groupId>
<artifactId>test</artifactId>
<version>1.2.3</version>

<dependencies>
  <dependency>
    <groupId>org.junit.jupiter</groupId>
    <artifactId>junit-jupiter-api</artifactId>
    <version>5.11.4</version>
    <scope>test</scope>
  </dependency>
</dependencies>

<repositories>
  <repository>
    <id>gitlab-maven</id>
    <url>http://gdk.test:8000/api/v4/projects/<project_id>/dependency_proxy/packages/maven</url>
  </repository>
</repositories>
</project>
settings.xml
<settings
    xmlns="http://maven.apache.org/SETTINGS/1.0.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0
                  http://maven.apache.org/xsd/settings-1.0.0.xsd"
>
  <mirrors>
    <mirror>
      <id>maven-default-http-blocker</id>
      <url>http://127.0.0.1/dont-go-here</url>
      <mirrorOf>dummy</mirrorOf>
      <blocked>false</blocked>
    </mirror>
  </mirrors>  
  <servers>
    <server>
      <id>gitlab-maven</id>
      <configuration>
        <httpHeaders>
          <property>
            <name>Private-Token</name>
            <value>***PAT TOKEN HERE***</value>
          </property>
        </httpHeaders>
      </configuration>
    </server>
  </servers>
</settings>

Remove the maven cache for the junit package:

$ rm -rf ~/.m2/repository/org/junit

Let's install the dependencies:

$ mvn install -s settings.xml 

On master

With $ tail -f log/development.log | grep "api/v4/projects/<project_id>/dependency_proxy/packages/maven", we get

Started GET "/api/v4/projects/22/dependency_proxy/packages/maven/org/junit/jupiter/junit-jupiter-api/5.11.4/junit-jupiter-api-5.11.4.pom" for 172.16.123.1 at 2024-12-17 11:51:59 +0100
Started GET "/api/v4/projects/22/dependency_proxy/packages/maven/org/junit/jupiter/junit-jupiter-api/5.11.4/junit-jupiter-api-5.11.4.pom.sha1" for 172.16.123.1 at 2024-12-17 11:52:00 +0100
Started GET "/api/v4/projects/22/dependency_proxy/packages/maven/org/junit/junit-bom/5.11.4/junit-bom-5.11.4.pom" for 172.16.123.1 at 2024-12-17 11:52:00 +0100
Started GET "/api/v4/projects/22/dependency_proxy/packages/maven/org/junit/junit-bom/5.11.4/junit-bom-5.11.4.pom.sha1" for 172.16.123.1 at 2024-12-17 11:52:01 +0100
Started GET "/api/v4/projects/22/dependency_proxy/packages/maven/org/junit/platform/junit-platform-commons/1.11.4/junit-platform-commons-1.11.4.pom" for 172.16.123.1 at 2024-12-17 11:52:01 +0100
Started GET "/api/v4/projects/22/dependency_proxy/packages/maven/org/junit/platform/junit-platform-commons/1.11.4/junit-platform-commons-1.11.4.pom.sha1" for 172.16.123.1 at 2024-12-17 11:52:01 +0100
Started GET "/api/v4/projects/22/dependency_proxy/packages/maven/org/junit/jupiter/junit-jupiter-api/5.11.4/junit-jupiter-api-5.11.4.jar" for 172.16.123.1 at 2024-12-17 11:52:01 +0100
Started GET "/api/v4/projects/22/dependency_proxy/packages/maven/org/junit/jupiter/junit-jupiter-api/5.11.4/junit-jupiter-api-5.11.4.jar.sha1" for 172.16.123.1 at 2024-12-17 11:52:01 +0100
Started GET "/api/v4/projects/22/dependency_proxy/packages/maven/org/junit/platform/junit-platform-commons/1.11.4/junit-platform-commons-1.11.4.jar" for 172.16.123.1 at 2024-12-17 11:52:01 +0100
Started GET "/api/v4/projects/22/dependency_proxy/packages/maven/org/junit/platform/junit-platform-commons/1.11.4/junit-platform-commons-1.11.4.jar.sha1" for 172.16.123.1 at 2024-12-17 11:52:02 +0100

We can see that the file is requested and the sha1 checksum = 10 requests.

(more than our dependency was pulled because that dependency had "further" dependencies that needed to be pulled = more files to pull)

With this MR

Started GET "/api/v4/projects/22/dependency_proxy/packages/maven/org/junit/jupiter/junit-jupiter-api/5.11.4/junit-jupiter-api-5.11.4.pom" for 172.16.123.1 at 2024-12-17 11:47:58 +0100
Started GET "/api/v4/projects/22/dependency_proxy/packages/maven/org/junit/junit-bom/5.11.4/junit-bom-5.11.4.pom" for 172.16.123.1 at 2024-12-17 11:47:59 +0100
Started GET "/api/v4/projects/22/dependency_proxy/packages/maven/org/junit/platform/junit-platform-commons/1.11.4/junit-platform-commons-1.11.4.pom" for 172.16.123.1 at 2024-12-17 11:47:59 +0100
Started GET "/api/v4/projects/22/dependency_proxy/packages/maven/org/junit/jupiter/junit-jupiter-api/5.11.4/junit-jupiter-api-5.11.4.jar" for 172.16.123.1 at 2024-12-17 11:48:00 +0100
Started GET "/api/v4/projects/22/dependency_proxy/packages/maven/org/junit/platform/junit-platform-commons/1.11.4/junit-platform-commons-1.11.4.jar" for 172.16.123.1 at 2024-12-17 11:48:00 +0100

As you can see, we only have requests to the file themselves. No more requests to checksums because they have been included in the response along with the file. Only 5 requests (again, 50% reduction) 🎉

🔮 Conclusions

As we saw above, we can see that this MR can have a pretty large impact on the amount of web requests triggered by Maven clients.

Edited by David Fernandez

Merge request reports

Loading