Skip to content

Maven Package Registry: implement remote included checksum

🔥 Problem and Context

While implementing in the Maven virtual registry, we noticed that when $ mvn requests a file, it can trigger the following requests:

  1. /pkg.pom. The file itself.
  2. /pkg.pom.sha1. The sha1 checksum.
  3. /pkg.pom.md5. the md5 checksum.

Depending on the situation and the conditions, maven clients can download only one of the checksums (it's generally the sha1). They can also download the checksum multiple times ( 🤷).

Please note that the above is 3 distinct web requests. Also note that this happens for every file that $ mvn will need to pull. Given how maven packages are organized (set of files), the amount of web requests can quickly snowball in large amounts.

In https://maven.apache.org/resolver/expected-checksums.html#remote-included-checksums, it is described that we can return checksums in the (1.) request. By doing so, maven clients will completely skip requests (2.) and (3.). This leads to a noticeable difference in the amount of web requests received by the registry = backend resources (such as cpu time or database queries) are saved.

Checksums are sent back as custom http headers (x-).

By saving requests (2.) and (3.), maven clients will also have faster execution time because they will spend less time on send web requests.

⚔️ Design

The main challenge here is that request (1.) is a download file endpoint. As such, we need to locate the uploaded file and return it. We currently have 3 ways of returning the file:

  1. When object storage is disabled, the file is read and returned from the file system.
  2. When object storage is enabled, here, the behavior depends on the proxy download configuration. a. When enabled, workhorse will stream the file back to the maven client. b. When disabled, the backend will return a redirect response to a signed url (temporary url) that points to the file on object storage.

Technically speaking, we can only return custom headers in (1.) and (2.) (a.). In (2.) (b.), due to the use of a signed url, it's impossible to instruct the object storage provider to return custom http headers.

Thus, for (2.), we will need to ignore the proxy download configuration and force a proxy_download: true configuration.

The proxy_download: true will make use of the workhorse send_logic. This is the price to pay for this improvement: we're going to increase the load on workhorse for the send_logic part.

🚒 Solution

  • Send back checksums with custom http headers when sending back a file in the Maven package registry.
  • Update the related specs.
  • Update the documentation to explain the force proxy_download: true situation.
  • Use a feature flag for this change.
Edited by David Fernandez