Optimize the upload confirmation endpoint in the virtual registries
🐺 Context
In the Maven Virtual registry feature, we have the GitLab instance that plays the role of pull-through cache between a Maven package manager client (such as $ mvn) and an upstream registry:
$ mvn <-> GitLab <-> External Maven registry (upstream)
There are multiple goals here but one of them is caching. The GitLab instance will look at the files going through and cache them (on object storage). When the exact same request is made, we can serve the file from the GitLab instance instead of the upstream registry.
Now, when a request to a file unknown to cache is made, we use a workhorse assisted upload to create the cache record. We're not going to enter the details of this logic but let's summarize it as two steps:
- Workhorse upload a file to a given location on object storage.
- Workhorse calls the rails backend to "confirm" the upload.
- This is when we create the model records associated with the upload. In our case, we will create an entry in the Virtual Registry cache.
The Virtual Registry needs to work under pretty strict execution time constraints. As such, in Maven Virtual Registry: optimize the upload con... (#499280 - closed), we looked at the upload confirmation endpoint to see if there was any improvement that could be made.
Looking from the object storage side, we noticed this:
There is a PutObject request. That's expected. This is the upload from step (1.). However, we also see a HeadObject. HEAD requests are usually used to get metadata around the file stored on object storage (like its size or if the accessed key exists) without download the contents of the file. This was a bit puzzling as workhorse is kind enough to give us several metadata fields (size, content type and even digests) in step 2. Why would we want to HEAD to file? It turned out that we have an existence check on the rails side.
That code is before my time here but it seems to be like a left over from the early versions of the workhorse assisted upload support. At the time of this writing, if we are on step (2.), then that means that workhorse has successfully uploaded the file. Moreover, the upload confirmation endpoint can't be accessed by outside. Its authentication logic will only allow workhorse to call it (by using a shared secret). Thus, if step (2.) is called, we are guaranteed that the file is on object storage: we don't need to check its existence.
Looking further in the rails logs, we also saw this:
TRANSACTION (0.1ms) BEGIN /*application:web,correlation_id:01JA849XP38Z9C9E2JGHX6X57D,endpoint_id:POST /api/:version/virtual_registries/packages/maven/:idpath/upload,db_config_database:gitlabhq_development,db_config_name:main,line:/app/models/virtual_registries/packages/maven/cached_response.rb:68:in `block in create_or_update_by!'*/
↳ app/models/virtual_registries/packages/maven/cached_response.rb:68:in `block in create_or_update_by!'
VirtualRegistries::Packages::Maven::CachedResponse Exists? (0.2ms) SELECT 1 AS one FROM "virtual_registries_packages_maven_cached_responses" WHERE "virtual_registries_packages_maven_cached_responses"."relative_path" = '/org/springframework/spring-web/6.1.12/spring-web-6.1.12.pom' AND "virtual_registries_packages_maven_cached_responses"."upstream_id" = 26 AND "virtual_registries_packages_maven_cached_responses"."status" = 0 LIMIT 1 /*application:web,correlation_id:01JA849XP38Z9C9E2JGHX6X57D,endpoint_id:POST /api/:version/virtual_registries/packages/maven/:idpath/upload,db_config_database:gitlabhq_development,db_config_name:main,line:/app/models/virtual_registries/packages/maven/cached_response.rb:68:in `block in create_or_update_by!'*/
↳ app/models/virtual_registries/packages/maven/cached_response.rb:68:in `block in create_or_update_by!'
VirtualRegistries::Packages::Maven::CachedResponse Create (0.3ms) INSERT INTO "virtual_registries_packages_maven_cached_responses" ("group_id", "upstream_id", "upstream_checked_at", "downloaded_at", "created_at", "updated_at", "size", "relative_path", "file", "object_storage_key", "upstream_etag", "content_type", "file_final_path") VALUES (22, 26, '2024-10-15 13:01:32.184432', '2024-10-15 13:01:32.184432', '2024-10-15 13:01:32.200679', '2024-10-15 13:01:32.200679', 2398, '/org/springframework/spring-web/6.1.12/spring-web-6.1.12.pom', 'spring-web-6.1.12.pom', '5f/9c/5f9c4ab08cac7457e9111a30e4664920607ea2c115a1433d7be98e97e64244ca/virtual_registries/packages/maven/26/upstream/26/cached_response/adfca8e75999d3ad0f1fcdff0de3960320c3600e349b3617f296bb0444d3800c', '"54ce07f4124259b2ea58548e9d620004"', 'text/xml', '5f/9c/5f9c4ab08cac7457e9111a30e4664920607ea2c115a1433d7be98e97e64244ca/@final/3d/99/23e08863197c92435f70a29a150c438a9f72b6b2b66296385b808e0a1c3d') RETURNING "id" /*application:web,correlation_id:01JA849XP38Z9C9E2JGHX6X57D,endpoint_id:POST /api/:version/virtual_registries/packages/maven/:idpath/upload,db_config_database:gitlabhq_development,db_config_name:main,line:/app/models/virtual_registries/packages/maven/cached_response.rb:68:in `block in create_or_update_by!'*/
↳ app/models/virtual_registries/packages/maven/cached_response.rb:68:in `block in create_or_update_by!'
VirtualRegistries::Packages::Maven::CachedResponse Update (0.3ms) UPDATE "virtual_registries_packages_maven_cached_responses" SET "file_store" = 2 WHERE "virtual_registries_packages_maven_cached_responses"."id" = 10941 /*application:web,correlation_id:01JA849XP38Z9C9E2JGHX6X57D,endpoint_id:POST /api/:version/virtual_registries/packages/maven/:idpath/upload,db_config_database:gitlabhq_development,db_config_name:main,line:/app/models/concerns/file_store_mounter.rb:17:in `block in mount_file_store_uploader'*/
↳ app/models/concerns/file_store_mounter.rb:17:in `block in mount_file_store_uploader'
TRANSACTION (0.1ms) COMMIT /*application:web,correlation_id:01JA849XP38Z9C9E2JGHX6X57D,endpoint_id:POST /api/:version/virtual_registries/packages/maven/:idpath/upload,db_config_database:gitlabhq_development,db_config_name:main,line:/lib/gitlab/database.rb:417:in `commit'*/
Within the same transaction, we:
- Create the model that holds the object storage reference. This is expected. That's actually the end goal of the upload confirmation endpoint.
- Update that previously created record to update the
file_storecolumn.
In other words, we have an INSERT followed by an UPDATE database statement. That's sad. We should access the table for writing only once. The backend logic side should be arranged in a way that the INSERT statement is "complete" and doesn't require any additional UPDATE.
This MR will thus update the virtual registry confirmation endpoint so that:
- we don't
HEADthe object storage file. - we don't
UPDATEthe ActiveRecord model right afterINSERT.
The changes for (1.) are deep in the workhorse assisted upload logic. As such, they have been scoped to virtual registry uploaders only for now. We opened Generalize avoiding the remote file existence c... (#502373) to generalize this solution as the other workhorse assisted uploads could benefit of this change too (one object storage request saved).
For (2.), it is gated logically. The UPDATE is currently fired due to a ruby statement that completely avoids ActiveRecord logic and goes straight to the database (#update_column). I'm assuming here that this is done for performance reasons (no model callbacks are executed). However, the downside is that we don't have ActiveRecord logic to detect that this can be a useless UPDATE if the new values that we want to apply are already the values set in the model. Thus, we will implement something similar: if the file_store column has already the value that we want to update to, then we can stop and avoid the UPDATE. We consider this change safe for all workhorse assisted uploads.
🤔 What does this MR do and why?
Update the file upload confirmation logic:
- Do not check for the file on object storage for uploads from the Virtual Registry.
- Do not update the
file_storecolumn if it is not necessary. - Update the related specs.
📖 References
🏎 MR acceptance checklist
Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
🖥 Screenshots or screen recordings
See next section for screen shots.
⚙ How to set up and validate locally
We're going to set up a Maven virtual registry to pull through a public file. This will trigger an upload to the virtual registry cache and we will be able to inspect the upload confirmation endpoint behavior.
Here is the general scenario:
- If using GDK, make sure that object storage (minio) is running.
- Set up the Maven virtual registry:
Feature.enable(:virtual_registry_maven) group = Group.find(<root_group_id>) r = ::VirtualRegistries::Packages::Maven::Registry.create!(group: g) # note down the registry id u = ::VirtualRegistries::Packages::Maven::Upstream.create!(group: g, url: 'https://repo1.maven.org/maven2') VirtualRegistries::Packages::Maven::RegistryUpstream.create!(group: g, registry: r, upstream: u) - You will need a PAT to access group
g(apiscope).
1️⃣ On master
We're going to run a $ curl request to simulate a file request to the virtual registry but before that, we need to monitor some aspects of the backend.
Open the minio admin console (minio/gdk-minio) and browse Monitoring -> Trace. Click the Start Trace button. Here you will see all the requests received by the object storage backend.
Run $ tail -f log/developement.log to see the SQL queries triggered by requests.
Run the request:
curl --header "Private-Token: <PAT>" "http://gdk.test:8000/api/v4/virtual_registries/packages/maven/<r.id>/org/springframework/spring-web/6.1.12/spring-web-6.1.12.pom"
In the minio admin console:
-> s3.PutObject + s3.HeadObject requests
From the development.log file:
TRANSACTION (0.1ms) BEGIN /*application:web,correlation_id:01JBV5P2V8DJZ14W9CGW3W8983,endpoint_id:POST /api/:version/virtual_registries/packages/maven/:idpath/upload,db_config_database:gitlabhq_development,db_config_name:main,line:/app/models/virtual_registries/packages/maven/cached_response.rb:66:in `block in create_or_update_by!'*/
↳ app/models/virtual_registries/packages/maven/cached_response.rb:66:in `block in create_or_update_by!'
VirtualRegistries::Packages::Maven::CachedResponse Exists? (0.1ms) SELECT 1 AS one FROM "virtual_registries_packages_maven_cached_responses" WHERE "virtual_registries_packages_maven_cached_responses"."relative_path" = '/org/springframework/spring-web/6.1.12/spring-web-6.1.12.pom' AND "virtual_registries_packages_maven_cached_responses"."upstream_id" = 27 AND "virtual_registries_packages_maven_cached_responses"."status" = 0 LIMIT 1 /*application:web,correlation_id:01JBV5P2V8DJZ14W9CGW3W8983,endpoint_id:POST /api/:version/virtual_registries/packages/maven/:idpath/upload,db_config_database:gitlabhq_development,db_config_name:main,line:/app/models/virtual_registries/packages/maven/cached_response.rb:66:in `block in create_or_update_by!'*/
↳ app/models/virtual_registries/packages/maven/cached_response.rb:66:in `block in create_or_update_by!'
VirtualRegistries::Packages::Maven::CachedResponse Create (0.3ms) INSERT INTO "virtual_registries_packages_maven_cached_responses" ("group_id", "upstream_id", "upstream_checked_at", "downloaded_at", "created_at", "updated_at", "size", "relative_path", "file", "object_storage_key", "upstream_etag", "content_type", "file_final_path", "file_md5", "file_sha1") VALUES (22, 27, '2024-11-04 08:46:55.429857', '2024-11-04 08:46:55.429857', '2024-11-04 08:46:55.457172', '2024-11-04 08:46:55.457172', 2398, '/org/springframework/spring-web/6.1.12/spring-web-6.1.12.pom', 'spring-web-6.1.12.pom', '5f/9c/5f9c4ab08cac7457e9111a30e4664920607ea2c115a1433d7be98e97e64244ca/virtual_registries/packages/maven/26/upstream/27/cached_response/adfca8e75999d3ad0f1fcdff0de3960320c3600e349b3617f296bb0444d3800c', '"54ce07f4124259b2ea58548e9d620004"', 'text/xml', '5f/9c/5f9c4ab08cac7457e9111a30e4664920607ea2c115a1433d7be98e97e64244ca/@final/33/0f/1005845f1be9369bf023f764554207238e8747896ed433796cd089b6b92d', '\x3534636530376634313234323539623265613538353438653964363230303034', '\x62626465376339666236643734663961323339336262333662306434616337653732633232376565') RETURNING "id" /*application:web,correlation_id:01JBV5P2V8DJZ14W9CGW3W8983,endpoint_id:POST /api/:version/virtual_registries/packages/maven/:idpath/upload,db_config_database:gitlabhq_development,db_config_name:main,line:/app/models/virtual_registries/packages/maven/cached_response.rb:66:in `block in create_or_update_by!'*/
↳ app/models/virtual_registries/packages/maven/cached_response.rb:66:in `block in create_or_update_by!'
VirtualRegistries::Packages::Maven::CachedResponse Update (0.1ms) UPDATE "virtual_registries_packages_maven_cached_responses" SET "file_store" = 2 WHERE "virtual_registries_packages_maven_cached_responses"."id" = 10975 /*application:web,correlation_id:01JBV5P2V8DJZ14W9CGW3W8983,endpoint_id:POST /api/:version/virtual_registries/packages/maven/:idpath/upload,db_config_database:gitlabhq_development,db_config_name:main,line:/app/models/concerns/file_store_mounter.rb:17:in `block in mount_file_store_uploader'*/
↳ app/models/concerns/file_store_mounter.rb:17:in `block in mount_file_store_uploader'
TRANSACTION (0.1ms) COMMIT /*application:web,correlation_id:01JBV5P2V8DJZ14W9CGW3W8983,endpoint_id:POST /api/:version/virtual_registries/packages/maven/:idpath/upload,db_config_database:gitlabhq_development,db_config_name:main,line:/lib/gitlab/database.rb:418:in `commit'*/
↳ lib/gitlab/database.rb:418:in `commit'
One INSERT + One UPDATE in the same TRANSACTION
2️⃣ With this MR
If you're running the scenario again, please make sure to reset the cache of the virtual registry. If your request is served from the cache, there will be no upload = no upload confirmation at all.
In a rails console:
u.cached_responses.destroy_all
Let's run the $ curl request again:
curl --header "Private-Token: <PAT>" "http://gdk.test:8000/api/v4/virtual_registries/packages/maven/<r.id>/org/springframework/spring-web/6.1.12/spring-web-6.1.12.pom"
Check minio admin trace log:
Only one s3.PutObject request!
In the development.log file:
TRANSACTION (0.1ms) BEGIN /*application:web,correlation_id:01JBV51FYCN8H91DD51R6YV9KE,endpoint_id:POST /api/:version/virtual_registries/packages/maven/:idpath/upload,db_config_database:gitlabhq_development,db_config_name:main,line:/app/models/virtual_registries/packages/maven/cached_response.rb:66:in `block in create_or_update_by!'*/
↳ app/models/virtual_registries/packages/maven/cached_response.rb:66:in `block in create_or_update_by!'
VirtualRegistries::Packages::Maven::CachedResponse Exists? (0.1ms) SELECT 1 AS one FROM "virtual_registries_packages_maven_cached_responses" WHERE "virtual_registries_packages_maven_cached_responses"."relative_path" = '/org/springframework/spring-web/6.1.12/spring-web-6.1.12.pom' AND "virtual_registries_packages_maven_cached_responses"."upstream_id" = 27 AND "virtual_registries_packages_maven_cached_responses"."status" = 0 LIMIT 1 /*application:web,correlation_id:01JBV51FYCN8H91DD51R6YV9KE,endpoint_id:POST /api/:version/virtual_registries/packages/maven/:idpath/upload,db_config_database:gitlabhq_development,db_config_name:main,line:/app/models/virtual_registries/packages/maven/cached_response.rb:66:in `block in create_or_update_by!'*/
↳ app/models/virtual_registries/packages/maven/cached_response.rb:66:in `block in create_or_update_by!'
VirtualRegistries::Packages::Maven::CachedResponse Create (0.2ms) INSERT INTO "virtual_registries_packages_maven_cached_responses" ("group_id", "upstream_id", "upstream_checked_at", "downloaded_at", "created_at", "updated_at", "file_store", "size", "relative_path", "file", "object_storage_key", "upstream_etag", "content_type", "file_final_path", "file_md5", "file_sha1") VALUES (22, 27, '2024-11-04 08:35:40.441760', '2024-11-04 08:35:40.441760', '2024-11-04 08:35:40.452059', '2024-11-04 08:35:40.452059', 2, 2398, '/org/springframework/spring-web/6.1.12/spring-web-6.1.12.pom', 'spring-web-6.1.12.pom', '5f/9c/5f9c4ab08cac7457e9111a30e4664920607ea2c115a1433d7be98e97e64244ca/virtual_registries/packages/maven/26/upstream/27/cached_response/adfca8e75999d3ad0f1fcdff0de3960320c3600e349b3617f296bb0444d3800c', '"54ce07f4124259b2ea58548e9d620004"', 'text/xml', '5f/9c/5f9c4ab08cac7457e9111a30e4664920607ea2c115a1433d7be98e97e64244ca/@final/4b/3b/c0f8365e1c4a2cc6dbc426fcbbce926c57b5008aa13c6c4d8f8827bfbb56', '\x3534636530376634313234323539623265613538353438653964363230303034', '\x62626465376339666236643734663961323339336262333662306434616337653732633232376565') RETURNING "id" /*application:web,correlation_id:01JBV51FYCN8H91DD51R6YV9KE,endpoint_id:POST /api/:version/virtual_registries/packages/maven/:idpath/upload,db_config_database:gitlabhq_development,db_config_name:main,line:/app/models/virtual_registries/packages/maven/cached_response.rb:66:in `block in create_or_update_by!'*/
↳ app/models/virtual_registries/packages/maven/cached_response.rb:66:in `block in create_or_update_by!'
TRANSACTION (0.1ms) COMMIT /*application:web,correlation_id:01JBV51FYCN8H91DD51R6YV9KE,endpoint_id:POST /api/:version/virtual_registries/packages/maven/:idpath/upload,db_config_database:gitlabhq_development,db_config_name:main,line:/lib/gitlab/database.rb:418:in `commit'*/
↳ lib/gitlab/database.rb:418:in `commit'
There is only one INSERT
The changes are behaving the way we want.


