Maven virtual registries local upstream support: Upstream model changes

🧦 Context

In the Maven virtual registry world, we pull files from an upstream through the GitLab virtual registry. While doing so, we also cache the requested file so that subsequent requests are served solely by GitLab and not by the upstream.

The modelization is as follows:

Registry <-n:1- RegistryUpstream -1:n-> Upstream <-n:1- CacheEntry

Basically, a Registry can have multiple Upstreams (through a join model RegistryUpstream) and an Upstream has many cache entries.

Up to now, an Upstream was an url with optional credentials. That would define how to access a remote upstream.

With Maven virtual registries: local upstreams (#548558) • Bonnie Tsang • 18.7 • On track, we want to introduce the concept of a local upstream. Instead of looking for files in remote upstreams, we look at the GitLab Maven package registry. In short words, we point to a local project or local group and we inspect the (Maven) packages available at that project or group. To handle this, we need to update the existing logic to inspect local upstreams. Before that, we need to define what we need for a local upstream.

⚔️ Design choices

So, we have the existing model VirtualRegistries::Packages::Maven::Upstream that has url, username and password attributes. This is a remote upstream.

A local upstream is defined by pointing to a local project or local group. local here means the same GitLab instance where the registry lives.

So, how are we going to store that an upstream is pointing to a Project or Group? The main problem with that is that the list of a upstreams on a Registry is an ordered list of upstreams. As such, we have a position attribute in the RegistryUpstream model. When locating which upstream has a requested file, it is critical to walk the list of upstream in order. That order represents the priority too. Given upstream1, upstream2 and upstream3, if a requested file exists in all 3 upstreams, virtual registries will always serve it from upstream1. If we use two different models, then suddenly from the registry object, it becomes more challenging to get the overall list of upstreams (mixed kinds). Also, the existing upstream model has existing extra fields that we're going to use for local upstreams too. For example, we have a name attribute and we have a fuzzy search feature on that attribute.

Thus, we would be better be served with re-using the same class. Single table inheritance? Nop, that's not recommended. On top of that, we don't want to have additional columns to the existing table that would not be used in mode remote or local. Ideally, we want to use the existing table and columns.

In #566217 (comment 2719335803), Moaz suggested to use the url field. From the discussion, we concluded that we could use the project or group global id in that field since global ids are still URIs. What about foreign key? We can't have one. The main reason is that destructive operations (deleting a group or a project) should delete all the linked local upstreams. However, deleting an upstream is not as straightforward as deleting a database record. We have additional things to do: update the positions in the registry upstreams of the impacted registries. See the API endpoint implementation that deletes a single upstream. Thus, we need to handle the foreign key handling in the rails side because we need to update the related registry upstream positions columns.

I thought about:

  1. using LFK and have a dedicated worker background job.
  2. introduce a delete trigger on the upstreams table that will update the registry upstream table for us.
  3. use the existing delete project or delete group event from the Event store and have a listener on that. When we receive an event, we delete the related records from the registry upstreams.

(1.) is a bit more complex than what we need so it's between (2.) and (3.).

(3.) is the simplest implementation path. Also with (2.), I'm a bit concerned that we could 🌊 deletes on the registry upstream table if an 🌊 of projects or groups are deleted. With (3.) we have tools to deal with volume: https://docs.gitlab.com/development/sidekiq/worker_attributes/#concurrency-limit.

(3.) has a disadvantage which is we use a background job and background jobs could be dropped = it's not guaranteed that the event store callback is executed. I think this is still reasonable. These leftovers should not impact the client logic (when trying to get packages out of deleted projects or groups, we will simply find nothing). We could also have another safety net in cleanup policies. We are already thinking about a general cleanup for orphan upstreams (upstreams not linked to any registry). We could also think about checking target less local upstreams.

🗒️ Implementation plan

This change being quite deep in the existing logic, we're going to split it in multiple MRs:

  1. Upstream changes. 👈 This is this MR.
  2. Local::Cache::Entry model and database changes. !207117 (merged)
  3. Update the services layer logic.
  4. Manage the local target destruction logic. eg. what happens when a project or group targeted by a local upstream is destroyed.
  5. Update the APIs logic. This is the client that manages upstreams (CRUD operations).
  6. Update the documentation.

🤔 What does this MR do and why?

  • Update the VirtualRegistries::Packages::Maven::Upstream model to handle project and group global ids in the url field.
    • Adjust the validations depending if we are in presence of a local or remote upstream.
  • Update the RegistryUpstream to have an additional validation for local upstreams (no nested ancestors).
  • Update the related specs.

📚 References

🖥️ Screenshots or screen recordings

No UI changes

🧑‍🔬 How to set up and validate locally

Requirements:

  • Have a GitLab instance with an EE license as the maven virtual registry is an EE only feature.
  • Have a top level group id ready (maintainer access level).
  • Have a PAT ready (scope api).

First, let's enable the feature flag: Feature.enable(:maven_virtual_registry)

Second, let's create a maven virtual registry and an upstream that points to maven central. We can use $ curl for that.

# create the registry object and note the id
$ curl -X POST -H "PRIVATE-TOKEN: <PAT>" "http://gdk.test:8000/api/v4/groups/<top level group id>/-/virtual_registries/packages/maven/registries?name=testing_counters"

# create the upstream and note the id
$ curl -H "PRIVATE-TOKEN: <PAT>" --data-urlencode 'url=https://repo1.maven.org/maven2' --data-urlencode 'name=upstream' -X POST http://gdk.test:8000/api/v4/virtual_registries/packages/maven/registries/<registry id>/upstreams

Since the new columns are not exposed in the APIs yet, we have to use a rails console to inspect these:

::VirtualRegistries::Packages::Maven::Upstream.last
=> #<VirtualRegistries::Packages::Maven::Upstream:0x0000000325f56888
 id: 269,
 group_id: 24,
 ...
 local_project_id: nil,
 local_group_id: nil,
 mode: 0>

local_project_id and local_group_id are nil. mode is 0 (remote).

🏁 MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by David Fernandez

Merge request reports

Loading