Maven virtual registries local upstream support: Upstream model changes
🧦 Context
In the Maven virtual registry world, we pull files from an upstream through the GitLab virtual registry. While doing so, we also cache the requested file so that subsequent requests are served solely by GitLab and not by the upstream.
The modelization is as follows:
Registry <-n:1- RegistryUpstream -1:n-> Upstream <-n:1- CacheEntry
Basically, a Registry can have multiple Upstreams (through a join model RegistryUpstream) and an Upstream has many cache entries.
Up to now, an Upstream was an url with optional credentials. That would define how to access a remote upstream.
With Maven virtual registries: local upstreams (#548558) • Bonnie Tsang • 18.7 • On track, we want to introduce the concept of a local upstream. Instead of looking for files in remote upstreams, we look at the GitLab Maven package registry. In short words, we point to a local project or local group and we inspect the (Maven) packages available at that project or group. To handle this, we need to update the existing logic to inspect local upstreams. Before that, we need to define what we need for a local upstream.
⚔️ Design choices
So, we have the existing model VirtualRegistries::Packages::Maven::Upstream that has url, username and password attributes. This is a remote upstream.
A local upstream is defined by pointing to a local project or local group. local here means the same GitLab instance where the registry lives.
So, how are we going to store that an upstream is pointing to a Project or Group? The main problem with that is that the list of a upstreams on a Registry is an ordered list of upstreams. As such, we have a position attribute in the RegistryUpstream model. When locating which upstream has a requested file, it is critical to walk the list of upstream in order. That order represents the priority too. Given upstream1, upstream2 and upstream3, if a requested file exists in all 3 upstreams, virtual registries will always serve it from upstream1. If we use two different models, then suddenly from the registry object, it becomes more challenging to get the overall list of upstreams (mixed kinds). Also, the existing upstream model has existing extra fields that we're going to use for local upstreams too. For example, we have a name attribute and we have a fuzzy search feature on that attribute.
Thus, we would be better be served with re-using the same class. Single table inheritance? Nop, that's not recommended. On top of that, we don't want to have additional columns to the existing table that would not be used in mode remote or local. Ideally, we want to use the existing table and columns.
In #566217 (comment 2719335803), Moaz suggested to use the url field. From the discussion, we concluded that we could use the project or group global id in that field since global ids are still URIs. What about foreign key? We can't have one. The main reason is that destructive operations (deleting a group or a project) should delete all the linked local upstreams. However, deleting an upstream is not as straightforward as deleting a database record. We have additional things to do: update the positions in the registry upstreams of the impacted registries. See the API endpoint implementation that deletes a single upstream. Thus, we need to handle the foreign key handling in the rails side because we need to update the related registry upstream positions columns.
I thought about:
- using LFK and have a dedicated worker background job.
- introduce a delete trigger on the upstreams table that will update the registry upstream table for us.
- use the existing delete project or delete group event from the Event store and have a listener on that. When we receive an event, we delete the related records from the registry upstreams.
(1.) is a bit more complex than what we need so it's between (2.) and (3.).
(3.) is the simplest implementation path. Also with (2.), I'm a bit concerned that we could
(3.) has a disadvantage which is we use a background job and background jobs could be dropped = it's not guaranteed that the event store callback is executed. I think this is still reasonable. These leftovers should not impact the client logic (when trying to get packages out of deleted projects or groups, we will simply find nothing). We could also have another safety net in cleanup policies. We are already thinking about a general cleanup for orphan upstreams (upstreams not linked to any registry). We could also think about checking target less local upstreams.
🗒️ Implementation plan
This change being quite deep in the existing logic, we're going to split it in multiple MRs:
-
Upstreamchanges.👈 This is this MR. -
Local::Cache::Entrymodel and database changes. !207117 (merged) - Update the services layer logic.
- Manage the local target destruction logic. eg. what happens when a project or group targeted by a local upstream is destroyed.
- Update the APIs logic. This is the client that manages upstreams (CRUD operations).
- Update the documentation.
🤔 What does this MR do and why?
- Update the
VirtualRegistries::Packages::Maven::Upstreammodel to handle project and group global ids in theurlfield.- Adjust the validations depending if we are in presence of a local or remote upstream.
- Update the
RegistryUpstreamto have an additional validation for local upstreams (no nested ancestors). - Update the related specs.
📚 References
- Maven virtual registries: local upstreams backe... (#566217) • David Fernandez • 18.7 • At risk.
- Maven virtual registries: local upstreams (#548558) • Bonnie Tsang • 18.7 • On track.
- Maven Virtual Registry - Road to General Availa... (&15089) • Tim Rizzi, Crystal Poole+ • 18.6 • On track.
🖥️ Screenshots or screen recordings
No UI changes
🧑🔬 How to set up and validate locally
Requirements:
- Have a GitLab instance with an EE license as the maven virtual registry is an EE only feature.
- Have a top level group id ready (
maintaineraccess level). - Have a PAT ready (scope
api).
First, let's enable the feature flag: Feature.enable(:maven_virtual_registry)
Second, let's create a maven virtual registry and an upstream that points to maven central. We can use $ curl for that.
# create the registry object and note the id
$ curl -X POST -H "PRIVATE-TOKEN: <PAT>" "http://gdk.test:8000/api/v4/groups/<top level group id>/-/virtual_registries/packages/maven/registries?name=testing_counters"
# create the upstream and note the id
$ curl -H "PRIVATE-TOKEN: <PAT>" --data-urlencode 'url=https://repo1.maven.org/maven2' --data-urlencode 'name=upstream' -X POST http://gdk.test:8000/api/v4/virtual_registries/packages/maven/registries/<registry id>/upstreams
Since the new columns are not exposed in the APIs yet, we have to use a rails console to inspect these:
::VirtualRegistries::Packages::Maven::Upstream.last
=> #<VirtualRegistries::Packages::Maven::Upstream:0x0000000325f56888
id: 269,
group_id: 24,
...
local_project_id: nil,
local_group_id: nil,
mode: 0>
local_project_id and local_group_id are nil. mode is 0 (remote).
🏁 MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.