Maven Virtual Registries: implement remote included checksums

🔥 Problem

When using maven virtual registries, we will see $ mvn requesting a file (/pkg.pom) and then, its digest file (/pkg.pom.sha1). Some other maven (older? 🤔) clients can even ask for the md5 (/pkg.pom.md5).

This could lead to maven virtual registries to receive between 2 and 3 requests for a single file.

This takes time and the $ mvn command being executed will be quite slowed down by this.

⛏️ Digging deeper

Looking at the requests, it made me wonder: why can't we send the digests along with the first request (/pkg.pom)? It would save network requests.

Looking around in the Maven documentation, I stumbled upon: https://maven.apache.org/resolver/expected-checksums.html. As explained, historically, the checksums were pulled on separate requests (/pkg.pom.sha1 or /pkg.pom.md5) but logic responsible for the file transport (called Resolver) has been updated to add different strategies:

Provided. Somehow, we can provide the checksums to the $ mvn client in advance.
Remote Included. Checksums are included in the response of the file request (/pkg.pom).
Remote External. Checksums are retrieved by firing additional requests (/pkg.pom.sha1).

Remove Included looks like exactly what we could have.

The strategies order above is also a priority order. This means that the logic will try Provided and then Remote Included. If both fails, the last resort will be Remote External.

Why is it so interesting to implement the Remote Included? Well, citing their documentation:

The big win here is that by obtaining hashes using “Remote Included” and not by “Remote External” strategy, we can halve the count of HTTP requests to download an Artifact.

This is quite big. We can slash by 2 the amount of requests that $ mvn will use. You can imagine the impact on the $ mvn command execution time and the load on the GitLab instance.

🚒 Solution

Looking at https://maven.apache.org/resolver/expected-checksums.html#remote-included-checksums, the expected result seems quite simple: the response of /pkg.pom should contain extra http headers x-checksum-sha1 and x-checksum-md5 containing the corresponding digest.

Technically, GitLab handles request /pkg.pom we already have the digests values and we can return them.

Simple, right?

Well, not so much actually. The main challenge is that we have 3 ways of returning a file from the GitLab instance:

From the file system.
From object storage.
- proxy_download: false: returning a redirect on a signed url that the client will follow.
- proxy_download: true: instructing workhorse to download the file from a signed url and sending it back to the client.

(1.) is quite straightforward to handle so, it will not be discussed.

(2.) is were the challenge lies. The virtual registry feature uses the dependency proxy object storage configuration. On gitlab.com, the proxy_download is set to false which means that the GitLab instance returns a redirect to a signed url. The main problem with this is that it is impossible to add custom response headers once the client follows that redirect. We can't instruct object storage: hey, when this file is requested, send back these response headers. This is probably due to security reasons.

So, proxy_download: false is a blocker for our solution.

The only way is using proxy_download: true (we can force it on a per file basis). This way, we can instruct workhorse: return the file from this signed url to the client, oh and by the way, set these custom headers.

The main downside to proxy_download: true is that we increase the load on workhorse. I think this is still acceptable given the possible benefits (amount of request / 2). Also, what makes me confident in this change is that we already have this case in the package registry area. Most of the registries will use proxy_download: false but we have NuGet that, due to technical reasons, must use proxy_download: true.

Overall, I think it's worth the try.

🔮 Other considerations

The above headers could be applied to the Maven package registry to also reap the lower amount of network requests.

Edited Nov 22, 2024 by David Fernandez