Skip to content

Handle warm cache situations in Maven virtual registry

David Fernandez requested to merge 467983-cache-logic-warm-cache into master

🖼 Context

This is the follow up MR of !163641 (merged).

Read up more about the context in !163641 (merged).

In this MR, we handle the situation where the maven virtual registry cache contains the requested file (warm cache situation).

As described in Maven Virtual Registry: Cache logic (#467983), we are aiming for a smarter logic than just: oh, we have the file return it from the GitLab instance. This is mainly due to how remote registries handle files or packages in general.

For example, on public registries, package are generally immutable. A package name+version can be created once but it will never be updated or overwritten. As such, if we cache those files, it's generally safe to consider them as valid for ever.

For third party package registries, this immutability is not guaranteed. For example, in the GitLab Maven Package registry, a package can be destroyed and a similarly name+version package with different files can be uploaded. Thus, files could be changed with time.

To cope with these aspects, we let users tell us how much time a cached file in the virtual registry is considered as valid or better said not stale. This will impact the performance too: using a cached file from the GitLab instance will be way faster than the path where we need to check with upstream if we have the same file (mainly because, with the valid cache, we don't send any request to the upstream).

Now, checking with the upstream can be a slow operation if we need to re-download the file (as if we were in the cold cache situation). Thus, we try to be smarter by:

  • storing the Etag coming from the upstream.
  • sending a HEAD request and compare the Etag values:
    • if they are the same, then the cached filed is still valid. The validity period of this cached file is started over.
    • if they are different (or missing since the Etag header can be missing), we don't any other choice than re-(download) the file (as if we were in a cold cache situation).

This MR implements the above logic.

🤔 What does this MR do and why?

  • Update the handle file request service to support the warm cache situation.
  • Update the API endpoints to be able to return the cached file directly.
  • Update the created cached file service to be a create or update service.
    • We don't want to keep the cached file content around if it doesn't match the upstream file. Thus, we will overwrite/update the file contents stored on object storage.
  • Update the related specs.

The maven virtual registry is still being implemented and is thus, behind a feature flag.

🦄 Screenshots or screen recordings

No UI changes

How to set up and validate locally

First, we will need to setup the maven virtual registry and request a file.

Follow !163641 (merged). Make sure that a CachedResponse row exists (created with the $ curl command). For the rest of the explanations, we're going to assume that we will request the same file (/org/springframework/spring-web/6.1.12/spring-web-6.1.12.pom).

Make sure that a cached response is available:

cr = ::VirtualRegistries::Packages::Maven::CachedResponse.last

Downloading the file again will not hit the upstream:

$ curl --header "Private-Token: <PAT>" "http://gdk.test:8000/api/v4/virtual_registries/packages/maven/<r.id>/org/springframework/spring-web/6.1.12/spring-web-6.1.12.pom"

Now, let's update the cached response to make it invalid:

cr.update!(upstream_checked_at: 3.days.ago)

Download the file:

$ curl --header "Private-Token: <PAT>" "http://gdk.test:8000/api/v4/virtual_registries/packages/maven/<r.id>/org/springframework/spring-web/6.1.12/spring-web-6.1.12.pom"

Notice that the upstream_checked_at has been updated to Time.zone.now:

cr.reload.upstream_checked_at

Now, let's update the cached response to make it invalid and make the store upstream etag different:

cr.update!(upstream_checked_at: 3.days.ago, upstream_etag: 'test')

Download the file:

$ curl --header "Private-Token: <PAT>" "http://gdk.test:8000/api/v4/virtual_registries/packages/maven/<r.id>/org/springframework/spring-web/6.1.12/spring-web-6.1.12.pom"

Notice that the upstream_checked_at has been updated to Time.zone.now and the upstream etag has been updated too:

cr.reload.upstream_checked_at
cr.upstream_etag

Also notice that cr.downloads_count / cr.downloaded_at is bumped to the correct values.

If you want to go further in the checks, use $ curl -vvv to log the response headers and notice that the Content-Type header is correctly set to the value sent by the upstream, in this case text/xml.

💽 Database review

Edited by David Fernandez

Merge request reports