Support remote checksums headers in the Maven virtual registry (!173687) · Merge requests · GitLab.org / GitLab

🍬 Context

In &14137 (closed), we're working towards the first step of Virtual Registries. In short words, Virtual Registries, the GitLab instance is used as a pull through proxy between a package manager client and an "upstream" registry.

Our first iteration covers the Maven package format.

Maven client ($ mvn for example) <-> GitLab instance <-> Upstream Registry (like Maven Central)

The main goal is that at some point, users can handle multiple upstreams behind a single virtual registry. In addition, files pulled through the GitLab instance are cached to object storage. This way, when pulling the same file multiple times will not require pulling from upstream.

Now, when Maven clients request a file, it can trigger multiple web requests. For example:

/pkg.pom
/pkg.pom.sha1
/pkg.pom.md5

The first one is the actual file and the two others are the digests. Not all the clients will request both digests. Most of the time, it's only the sha1. We're assuming that historically, the md5 was the first checksum to be used and then maven clients have transitioned to sha1. However, we still need to support both.

🎯 Avoiding the checksum requests

We've been actively working on improving the execution time of the maven virtual registry. While looking to further improve requests on digests, we stumbled upon https://maven.apache.org/resolver/expected-checksums.html#remote-included-checksums.

In very short words, the response returning the file (request 1.) can set specific response headers to also return both checksums. By doing this, Maven clients will entirely skip requests (2.) and (3.). Thus in a single request, maven registries can send back the file and its digests.

In case, the response headers are missing, Maven clients will fall back to the old way: requesting the digest directly (2. and 3.).

Implementing the custom headers

So, it's "just" a matter of setting the correct custom response headers when we send back the file from the maven virtual registry.

To send back the file, we leverage this helper function. The body if this function is very clear: we have 3 cases to handle.

Object storage disabled. In this case, the file system is used. We can set custom headers in this case.
Object storage enabled + direct download enabled. In this case, we send a redirect to the signed url of the file on object storage. Here, we can't send custom response headers. That would be basically telling object storage: create a signed url to get this file, oh and by the way, I want these custom response headers to be set when you return the file.
- Technically, it's possible to control response headers but only a known subset, such as the content type or content disposition header. The problem is that Maven clients expect a custom header x-*.
- The other challenge here is that different object storage providers will set different headers that might cover what we want here. For example, GCS will set the x-goog-hash headers.
Object storage enabled + direct download disabled. In this case, we use workhorse to proxy the download of the file (through a signed url). Since, it's workhorse, its logic (send_url) is flexible enough to allow us setting any response header we want.
- We actually leverage this for the content type and disposition headers.

From the above, the improvement can only be applied to (1.) and (3.). Given the possible impact of this improvement, we're going to force to not use direct downloads when object storage is configured. This means that no matter how the uploader class is configured (direct downloads allowed or not), the maven virtual registry download endpoint will always use workhorse to proxy the download, hence apply the improvement.

This is issue Maven Virtual Registries: implement remote incl... (#505819 - closed).

🤔 What does this MR do and why?

Update the API helper #present_carrierwave_file! to allow passing extra response headers.
Update the related service in the maven virtual registry to return the file and its digests.
Update the Maven virtual registry API endpoint to properly read the response from the service and call #present_carrierwave_file! to set the custom headers.
Update the related specs.

The maven virtual registry is behind a WIP feature flag. Thus, we don't need a Changelog here.

📚 References

🏁 MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

✅

🦄 Screenshots or screen recordings

No UI changes.

⚙️ How to set up and validate locally

Enable the feature flag : Feature.enable(:virtual_registry_maven).
Have a PAT and a root group (any visiblity) ready.

For the virtual registry settings, we don't have an UI or API (yet), we thus need to create them in a rails console:

r = ::VirtualRegistries::Packages::Maven::Registry.create!(group: <root_group>)
u = ::VirtualRegistries::Packages::Maven::Upstream.create!(group: <root_group>, url: 'https://repo1.maven.org/maven2')
VirtualRegistries::Packages::Maven::RegistryUpstream.create!(group: <root_group>, registry: r, upstream: u)

We're going to use $ curl to inspect the headers.

$ curl -vvv --header "Private-Token: <PAT>" "http://gdk.test:8000/api/v4/virtual_registries/packages/maven/<r.id>/org/springframework/spring-context/6.1.15/spring-context-6.1.15.pom"
...
< X-Checksum-Md5: 119e499949edb2d335758abfe596429c
< X-Checksum-Sha1: 2f9638d08b69b15cfabcc41c133d1f796cb827bc
...

⚠️ You might need to execute the above $ curl command twice. The first time, it will pull the file and return it from the upstream while uploading it to the GitLab instance. The second time, it will be served from the GitLab instance only (cache hit situation). That second request is where we expect to apply this optimization.

🏎️ Performance analysis

For this, we're going to use a dummy maven project and connect it with the maven virtual registry to pull all its dependencies.

Follow the setup steps described in !173001 (merged), including warming the cache.

Once the cache is warmed (cached entries exist), we can run the pipeline again using this MR or switching to master.

Here are the results.

Scenario	Amount of web requests received by the GitLab instance	Execution time reported by the mvn command
On `master`	`2826`	`49.666 s`
With this MR	`942` (`~66%` reduction 😱)	`27.446 s` (`~45%` reduction 😱)

⚠️ we will have different numbers on gitlab.com as conditions are different that the local setup. However, we should be able to see a noticeable performance improvement.

Another thing that is interesting to note is that, with this MR, the amount of web requests sent by $ mvn is almost exactly the amount of cached responses. In other words, for each dependency that $ mvn needs to pull for the dummy project, it will trigger only 1 web request and not more. 🎉

Edited Nov 28, 2024 by David Fernandez

Support remote checksums headers in the Maven virtual registry