Maven virtual registry: do not cache file digests (!168673) · Merge requests · GitLab.org / GitLab

🌳 Context

In the Maven Virtual registry feature, we have the GitLab instance that plays the role of pull-through cache between a Maven package manager client (such as $ mvn) and an upstream registry:

$ mvn <-> GitLab <-> External Maven registry (upstream)

There are multiple goals here but one if them is caching. The GitLab instance will look at the files going through and cache them (on object storage). When the exact same request is made, we can serve the file from the GitLab instance instead of the upstream registry.

Now, we need to dig a bit on what Maven clients when they pull a package. For starters, pulling a Maven package does not mean downloading a single (archive) file. Instead, Maven clients will download a set of files (.pom, .jar, different maven-metadata.xml files). Then, for each file, clients can also request the related digests (usually md5 or sha1) by appending the correct extension to the filename. For example, for a .pom file, clients will also download .pom.md5 and .pom.sha1. Which digests are actually pulled depends on the Maven client implementation but it is expected that at least one digest file is downloaded.

In the original design of the Maven virtual registry, we didn't care too much about which file is downloaded. The GitLab would simply look for the related file on the upstream registry, cache it and that's it. If we take our example again, that means that downloading .pom, .pom.sha1 and .pom.md5 will result in:

3x download a file from the upstream.
3x upload the file to object storage.
3x create a cache entry (called cached response) record.

Now multiply the above by the amount of files for a single package and this can snowball in a crazy amount of downloads, uploads and record creation for Maven applications (that have a non trivial amount of dependencies).

This MR aims to apply a performance improvement that has been implemented in the GitLab Maven Package Registry. The main idea is: we are using workhorse assisted uploads to object storage. During this logic, workhorse will automatically compute and send the md5 and sha1 digests. In other words, when rails handles the upload for the .pom file, in the structure sent by workhorse, we already have the md5 and sha1. Thus, we can leverage that by storing the digests on the cache response for .pom. This way, when we receive a request for .pom.md5 or .pom.sha1, we simply need to locate the .pom cached response and return the requested digest. No more upstream interactions, uploads or database records creation. 🚀

We need to have an additional caution here: FIPS. When GitLab is in those conditions, md5 digests should be disabled. Thus, we will not store them and we will reject requests asking for them.

The above improvement has been described in issue Maven virtual registry: do not cache digests (#497290 - closed)

📻 What does this MR do and why?

In the maven cached responses table:
- add a file_md5 column. This one is optional.
- add a file_sha1 column. This one is required.
When creating a maven cached response record. Read the digests coming from the workhorse parameter and set them properly.
When the virtual registry receives a digest request, retrieve the related file and return the correct digest.
- Maven clients will download the related file and then, the digests. Thus, if we don't find the related file, we will simply return a 404 Not Found response. We will not ask the upstream to fulfill the request.
Update the related specs.

The Maven virtual registry feature is currently being worked on. As such, it is gated behind a WIP feature flag. Thus, we don't have any changelog here.

🚥 MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

✅

🖥 Screenshots or screen recordings

No UI changes. 🤷

Moreover, from a Maven client point of view, there is literally no changes. When a file is requested, it is returned. When a digest is requested, it is returned. The changes of this MR are completely transparent. The only difference should be: Maven virtual registries is faster to process the digest requests.

⚙ How to set up and validate locally

To test the changes of this MR, one possible way is to use $ curl to simulate the requests done by Maven clients.

First, a bit of setup. You will need a root group. In a rails console:

Feature.enable(:virtual_registry_maven)
group = Group.find(<root_group_id>)

r = ::VirtualRegistries::Packages::Maven::Registry.create!(group: g) # note down the registry id
u = ::VirtualRegistries::Packages::Maven::Upstream.create!(group: g, url: 'https://repo1.maven.org/maven2')
VirtualRegistries::Packages::Maven::RegistryUpstream.create!(group: g, registry: r, upstream: u)

The above set up an upstream that points to the official public Maven registry: Maven central.

Keep the rails console around as you will need it to reset the backend status. You will also need a Personal access token that has access to the root group.

1️⃣ Pulling a file and its digests

Let's pull an existing file (this will create a cached response record) and then, the digests.

$ curl --header "Private-Token: <PAT>" "http://gdk.test:8000/api/v4/virtual_registries/packages/maven/<registry_id>/org/springframework/spring-web/6.1.12/spring-web-6.1.12.pom"

This will pull the .pom file from upstream and create a cached response.

In the rails console:

::VirtualRegistries::Packages::Maven::CachedResponse.count # 1

::VirtualRegistries::Packages::Maven::CachedResponse.last # points to the .pom file

Now, let's pull the digests:

$ curl --header "Private-Token: <PAT>" "http://gdk.test:8000/api/v4/virtual_registries/packages/maven/<registry_id>/org/springframework/spring-web/6.1.12/spring-web-6.1.12.pom.md5"
$ curl --header "Private-Token: <PAT>" "http://gdk.test:8000/api/v4/virtual_registries/packages/maven/<registry_id>/org/springframework/spring-web/6.1.12/spring-web-6.1.12.pom.sha1"

Both request will succeed. Let's check the rails console:

::VirtualRegistries::Packages::Maven::CachedResponse.count # 1

=> no cached response was created for those digests 🚀 ✅

Now, let's ask for a digest of a file that hasn't been pulled in cached responses:

$ curl --header "Private-Token: <PAT>" "http://gdk.test:8000/api/v4/virtual_registries/packages/maven/<registry_id>/org/springframework/spring-web/6.1.12/spring-web-6.1.12-test.pom.md5"
{"message":"404 File of the requested digest not found in cached responses Not Found"}

We have a 404 Not Found response with a body explaining what happened. ✅

2️⃣ FIPS mode

Now, we need to check the updated behavior in FIPS mode.

To simulate the FIPS mode, let's update the function that returns the boolean:

diff --git a/lib/gitlab/fips.rb b/lib/gitlab/fips.rb
index c71bd0e1ac9c..64142f7fd355 100644
--- a/lib/gitlab/fips.rb
+++ b/lib/gitlab/fips.rb
@@ -23,7 +23,7 @@ class << self
       #
       # @return [Boolean]
       def enabled?
-        ::Labkit::FIPS.enabled?
+        true
       end
     end
   end

(restart the rails backend if necessary)

Now, in the rails console, let's reset the cache status:

::VirtualRegistries::Packages::Maven::CachedResponse.destroy_all

Ok, let's pull the .pom file again:

$ curl --header "Private-Token: <PAT>" "http://gdk.test:8000/api/v4/virtual_registries/packages/maven/<registry_id>/org/springframework/spring-web/6.1.12/spring-web-6.1.12.pom"

Now, let's see the rails console:

::VirtualRegistries::Packages::Maven::CachedResponse.count # 1

::VirtualRegistries::Packages::Maven::CachedResponse.last # the `file_md5` field is nil

-> The backend didn't store the md5 digest ✅

Lastly, let's request the md5 digest:

$ curl --header "Private-Token: <PAT>" "http://gdk.test:8000/api/v4/virtual_registries/packages/maven/<registry_id>/org/springframework/spring-web/6.1.12/spring-web-6.1.12.pom.md5"
{"message":"400 Bad request - MD5 digest is not supported when FIPS is enabled"}

The backend returns a 400 Bad request to clearly indicate that this request should not be made ✅ . This is to communicate that under FIPS conditions, there is something that is asking for a md5 digest and this is not expected.

We're following a similar response that the GitLab Maven package registry has.

💾 Database review

🔼 Migration up

main: == [advisory_lock_connection] object_id: 130140, pg_backend_pid: 20352
main: == 20241009123810 AddFileDigestsToVirtualRegistriesMavenCachedResponses: migrating 
main: -- quote_table_name("virtual_registries_packages_maven_cached_responses")
main:    -> 0.0000s
main: -- execute("TRUNCATE TABLE \"virtual_registries_packages_maven_cached_responses\"")
main:    -> 0.0057s
main: -- transaction_open?(nil)
main:    -> 0.0000s
main: -- add_column(:virtual_registries_packages_maven_cached_responses, :file_md5, :binary, {:if_not_exists=>true})
main:    -> 0.0205s
main: -- add_column(:virtual_registries_packages_maven_cached_responses, :file_sha1, :binary, {:null=>false, :if_not_exists=>true})
main:    -> 0.0017s
main: == 20241009123810 AddFileDigestsToVirtualRegistriesMavenCachedResponses: migrated (0.0438s) 

main: == [advisory_lock_connection] object_id: 130140, pg_backend_pid: 20352
ci: == [advisory_lock_connection] object_id: 130480, pg_backend_pid: 20359
ci: == 20241009123810 AddFileDigestsToVirtualRegistriesMavenCachedResponses: migrating 
ci: -- transaction_open?(nil)
ci:    -> 0.0000s
ci: -- add_column(:virtual_registries_packages_maven_cached_responses, :file_md5, :binary, {:if_not_exists=>true})
ci:    -> 0.0027s
ci: -- add_column(:virtual_registries_packages_maven_cached_responses, :file_sha1, :binary, {:null=>false, :if_not_exists=>true})
ci:    -> 0.0014s
ci: == 20241009123810 AddFileDigestsToVirtualRegistriesMavenCachedResponses: migrated (0.0138s) 

ci: == [advisory_lock_connection] object_id: 130480, pg_backend_pid: 20359

🔽 Migration down

main: == [advisory_lock_connection] object_id: 129700, pg_backend_pid: 19530
main: == 20241009123810 AddFileDigestsToVirtualRegistriesMavenCachedResponses: reverting 
main: -- transaction_open?(nil)
main:    -> 0.0000s
main: -- remove_column(:virtual_registries_packages_maven_cached_responses, :file_md5, {:if_exists=>true})
main:    -> 0.0101s
main: -- remove_column(:virtual_registries_packages_maven_cached_responses, :file_sha1, {:if_exists=>true})
main:    -> 0.0014s
main: == 20241009123810 AddFileDigestsToVirtualRegistriesMavenCachedResponses: reverted (0.0256s) 

main: == [advisory_lock_connection] object_id: 129700, pg_backend_pid: 19530
ci: == [advisory_lock_connection] object_id: 129700, pg_backend_pid: 19923
ci: == 20241009123810 AddFileDigestsToVirtualRegistriesMavenCachedResponses: reverting 
ci: -- transaction_open?(nil)
ci:    -> 0.0000s
ci: -- remove_column(:virtual_registries_packages_maven_cached_responses, :file_md5, {:if_exists=>true})
ci:    -> 0.0226s
ci: -- remove_column(:virtual_registries_packages_maven_cached_responses, :file_sha1, {:if_exists=>true})
ci:    -> 0.0037s
ci: == 20241009123810 AddFileDigestsToVirtualRegistriesMavenCachedResponses: reverted (0.0525s) 

ci: == [advisory_lock_connection] object_id: 129700, pg_backend_pid: 19923

🏃 Performance review

🛠 Setup

Use this dummy maven application.
Configure a virtual registry in our local GitLab instance that targets maven central, which is the official public registry.

Configure the maven application so that we replace the maven central reference with the virtual registry endpoint. We will use this settings.xml file

<settings>
  <mirrors>
    <mirror>
      <id>gitlab-maven</id>
      <name>GitLab proxy of central repo</name>
      <url>http://gdk.test:8000/api/v4/virtual_registries/packages/maven/<registry.id></url>
      <mirrorOf>central</mirrorOf>
    </mirror>
  </mirrors>
  <servers>
    <server>
      <id>gitlab-maven</id>
      <configuration>
        <httpHeaders>
          <property>
            <name>Private-Token</name>
            <value><PAT></value>
          </property>
        </httpHeaders>
      </configuration>
    </server>
  </servers>
</settings>

$ mvn compile -s settings.xml that will pull packages (through the virtual registry) and compile the application.
- In this configuration, we will pull close to 1000 files.
To keep the analysis focused on the web requests done between workhorse and rails, the object storage is disabled in our local GitLab instance (the file system will be used to store the uploaded files).
This is, by no means, a highly accurate performance analysis but the goal here is to have a glimpse on the improvements.

📊 Numbers

Not all clients use the digests the same way. From our observations, $ mvn will not necessarily request digests during the first pull. However, it will during the second pull and third pull.

So, we're going to pull packages 3 times:

The cache being empty, we will pull all files from upstream.
The cache will contain the files. With this MR, digests will be ready to be delivered too.
By this time, all requests should be covered by the GitLab instance too.

Version	Cached Responses count(1st pull)	Cached Responses count(2nd pull)	Cached Responses count(3rd pull)	Upstream downloads (1st pull)	Upstream downloads (2nd pull)	Upstream downloads (3rd pull)	GitLab downloads (1st pull)	GitLab downloads (2nd pull)	GitLab downloads (3rd pull)
With `master`	916	1832	1832	916	916	0	0	916	1832
With this MR	916	916	916	916	0	0	0	1832	1832

We can see that with this MR, we more accurately cover consecutive requests. From the 2nd pull, we already have 0 upstream downloads. Moreover, we have less cached response records. We didn't measure this here but, it also leads to a reduction of interactions with object storage.

Note that for the 1st pull (cold cache), we don't have any impact.

Edited Oct 16, 2024 by David Fernandez

Maven virtual registry: do not cache file digests