Skip to content

Introduce a local cache entry model for maven virtual registries

🧦 Context

In the Maven virtual registry world, we pull files from an upstream through the GitLab virtual registry. While doing so, we also cache the requested file so that subsequent requests are served solely by GitLab and not by the upstream.

The modelization is as follows:

Registry <-n:1- RegistryUpstream -1:n-> Upstream <-n:1- CacheEntry

Basically, a Registry can have multiple Upstreams (through a join model RegistryUpstream) and an Upstream has many cache entries.

Up to now, an Upstream was an url with optional credentials. That would define how to access a remote upstream.

With Maven virtual registries: local upstreams (#548558) • Bonnie Tsang • 18.6, we want to introduce the concept of a local upstream. Instead of looking for files in remote upstreams, we look at the GitLab Maven package registry. In short words, we point to a local project or local group and we inspect the (Maven) packages available at that project or group. To handle this, we need to update the existing logic to inspect local upstreams. Before that, we need to define what we need for a local upstream.

⚔️ Design choices

In !206725 (merged), the Upstream model will support a local mode through specific url (global ids pointing to projects or groups).

Now, on the cache entry level the problem we're going to have is that we need to record that a local Upstream has a requested file and it is available to a given PackageFile (model from the package registry). It is important to persist this information so that when we get another request for the same relative path, we can quickly find out that a local upstream can fulfill the request. If we don't have that information, we would "walk" the list of upstreams all the time and that can be a costly operation.

Thus, the challenge is how to store a PackageFile id in the existing cache entry table. Well, the short is: we can't. The existing table has many columns related to cache entry located on a remote upstreams. Thus, it is not a wise idea to try to store a package file id in the existing table.

Our solution here is: introduce a new dedicated table that will have the correct columns to store that a local upstream has a package file id that can fulfill a relative path.

Now, on the cache entry table subject, these tables are the core of the virtual registry cache system. In #473144 (comment 2199015293), we decided to use partitioning for the existing (remote) cache entries. The partition key is the relative_path so that when we receive a request for a file (this is the most used request in virtual registries), we can leverage that to locate quickly the cache entry.

For local upstreams, users could be aggregating large amounts of package files under a single upstream. As such, we decided to use the exact same approach: partition the table. We will also use the same way: use the relative_path as a partition key to speed up the access by relative_path.

At some point, it should be ideal to rename virtual_registries_packages_maven_cache_entries into virtual_registries_packages_maven_remote_cache_entries to be extra clear. This will bring changes to the models and API. As such, I don't want to handle this here.

🗒️ Potential main queries

We don't have the follow up MRs that will actually use this new model but the expectation on the queries is as follows (selected from !174985 (merged))

  1. Insert a new record to the table. Pretty straightforward thing.
  2. Search a record given a relative_path. Similar to this one (without the status column since we don't have it).

Destruction is mainly handled by cascading deletes and not directly by users. This means that a record in this table is only destroyed when the related (top level) group, the upstream or the package file is destroyed. For this part, since we don't have anything particular to do here (contrary to remote upstreams where we need to destroy a file on object storage), we're using the usual database cascading delete.

🗒️ Implementation plan

This change being quite deep in the existing logic, we're going to split it in multiple MRs:

  1. Upstream changes. !206725 (merged)
  2. Local::Cache::Entry model and database changes. 👈 This is this MR.
  3. Update the services layer logic.
  4. Manage the local target destruction logic. eg. what happens when a project or group targeted by a local upstream is destroyed.
  5. Update the APIs logic. This is the client that manages upstreams (CRUD operations).
  6. Update the documentation.

🤔 What does this MR do and why?

  • Introduce ::VirtualRegistries::Packages::Maven::Local::Cache::Entry model and its table.
    • A few basic model validations are also introduced.
  • Add the related specs.

In the follow up MRs, we will add the necessary scopes and additional util functions to the model.

📚 References

🖥️ Screenshots or screen recordings

No UI changes

🧑‍🔬 How to set up and validate locally

There is not much you can do with the model alone since there is no business code that will interact with it (yet).

We can still play around in a rails console.

top_level_group = Group.top_level.sample

upstream = ::VirtualRegistries::Packages::Maven::Upstream.create!(group: top_level_group, url: "https://gitlab.com/maven1", name: "testing local cache entries")

e = ::VirtualRegistries::Packages::Maven::Local::Cache::Entry.new(group: top_level_group, upstream: upstream)

e.valid?
=> false

e.errors.to_a
=> ["Package file must exist", "Relative path can't be blank"]

e.relative_path = "foo/bar"
e.package_file = Packages::PackageFile.last

e.valid?
=> true

🏁 MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by David Fernandez

Merge request reports

Loading