Maven virtual registry: do not cache digests

🔥 Problem

In the maven virtual registries, we started with the design that the virtual registry would see file requests going through. These file requests would be replied with whatever the upstream have.

In the maven packages world, a package pull is not a single file download. Instead multiple files are downloaded from the target registry. Among them, clients can ask for digests of a given file. For example, .pom and .pom.sha1 are two files that can be requested.

Given the virtual registry initial design, digests requests are treated as file requests. Thus, they are pulled from the upstream and stored in object storage. These interactions adds latency to requests.

Now, during the implementation, we started seeing technical challenges on the amount of data to store. To ease this aspect, we decided to split the data by package formats. In other words, cached responses for maven packages live in their own table that is completely separated from other package formats. As such, cached responses can have custom columns or data that the maven format requires.

🚒 Solution

Given that we have dedicated tables, we can apply what we have been doing in the maven package registry:

  • Have columns in cached responses to store digests.
  • When a file is uploaded, read the digests that workhorse is giving along with the upload and store these in the correct column.
  • When a digest is requested, locate the cached response of the related file and return the correct digest column.
  • Do not ask digests to upstream and do not put them on object storage.

This should lower the amount of object storage interactions and upstream interactions.