Skip to content

Refine the etag verification service

🌻 Context

We're working on the very first version of the dependency proxy for packages. See #407460 (comment 1373731852) for all the details from the technical investigation.

At the core, the concept is right simple. GitLab will act with as a proxy. Basically, users can pull packages through it and GitLab will be a pull-through cache.

Package Manager clients (npm, maven, ...) <-> GitLab <-> External Package Registry

Because, GitLab is in the middle (aka proxy) of the package transport, we can leverage the GitLab Package registry to use it as a cache. In other words, before contacting the external package registry, we can check the local project registry to check if the package is already there. If that's the case, we can return it directly.

During our verifications on staging, we noticed that this part was not functioning properly. Before going into the details of the fix, let's see how the dependency proxy cache works.

Dependency proxy cache

The dependency proxy will not simply receive a request and check if the requested file is present in the cache, it will also proactively check if that file didn't change in the external package registry.

You see, some registries will allow users to overwrite packages or we could even imagine users removing a version to re-publish it with different files. The result is the same: the similarly named file in the external package registry could have changed.

We want to automatically support this by detecting this situation and then, clear the cache entry for that file and re-download/re-publish the file from the external package registry into the dependency proxy.

This is nice but how do we detect this situation? Well, we use what is already implemented in the dependency proxy for container images: we HEAD the file on the external package registry and read the ETag field. Usually, the ETag is some kind of digest of the file. We thus can compare that field with the digests of our cache entry.

This works well unless the remote registry does something not expected with the ETag field. Well, yes, we have such registries. See #423033 (closed) for what we found out. Some registries:

  1. use an ETag field that is composed of the sha1 digest.
  2. will not send any ETag field at all on HEAD requests. 😿.

This MR fixes both situations with the following:

  1. Check if the know digests are included in the ETag field.
  2. There is nothing we can do here. We thus will handle this situation in the same fashion as if there is a network hiccup with the external package registry. If we have the requested package file, we will return it. We could call this the "simple & dumb" mode because it's either we have the file in the cache or we don't have it.

🔍 What does this MR do and why?

  • Update the ee/app/services/dependency_proxy/packages/maven/verify_package_file_etag_service.rb to support the situations where the ETag field from the HEAD request:
    • is missing.
    • contains a digest.
  • Update the related specs.

The entire dependency proxy for Maven is behind a feature flag. See #415218 (closed). Hence, we don't have any changelog here.

🖼 Screenshots or screen recordings

The dependency proxy is meant to be used with Maven clients which are CLI tools. So, uh, no 🌈 UI.

🔧 How to set up and validate locally

  1. Have a project ready with a personal access token.
  2. Enable the dependency proxy for maven:
    Feature.enable(:packages_dependency_proxy_maven)

1️⃣ ETag missing

For this case, we're going to use https://github.com/10io/fruits.

Even though that repository is public, you need a token to pull files out of it. Create a personal access token (classic).

  1. Set up the dependency proxy on the project to target that repository:
    Project.find(<local_project_id>).create_dependency_proxy_packages_setting!(maven_external_registry_url: 'https://maven.pkg.github.com/10io/fruits', maven_external_registry_username: '<your github username>', maven_external_registry_password: '<your github access token>' enabled: true)
  2. Disable background workers (so that packages can't be destroyed):
    $ gdk stop rails-background-jobs
  3. Check that the project has no packages:
    Project.find(<project_id>).packages
    => []
  4. Tail the workhorse logs: $ gdk tail workhorse.
  5. Let's pull the pom.xml file of one of the packages:
    $ curl "http://<gitlab user>:<gitlab pat>@gdk.test:8000/api/v4/projects/<project_id>/dependency_proxy/packages/maven/com/fruits/ananas/1.3.5/ananas-1.3.5.pom"
    <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
      <modelVersion>4.0.0</modelVersion>
      ...
  6. Check the packages:
    Project.find(<project_id>).packages
    => [#<Packages::Package:0x00000001699b7b10 id: 330
    
    Project.find(232).packages.first.package_files
    => [#<Packages::PackageFile:0x0000000169f073b8
  7. Check the workhorse logs:
    2023-11-02_16:19:49.03199 gitlab-workhorse      : {"client_mode":"local_tempfile","copied_bytes":1237,"correlation_id":"01HE8DA0WV18BXQPM4F7BTJKSG","filename":"upload","is_local":true,"is_multipart":false,"is_remote":false,"level":"info","local_temp_path":"/Users/david/projects/gitlab-development-kit/gitlab/shared/packages/tmp/uploads","msg":"saved file","remote_id":"","time":"2023-11-02T17:19:49+01:00"}
    2023-11-02_16:19:49.17193 gitlab-workhorse      : {"content_type":"applicaton/octet-stream","correlation_id":"01HE8DA0WV18BXQPM4F7BTJKSG","duration_ms":28056,"host":"gdk.test:8000","level":"info","method":"GET","msg":"access","proto":"HTTP/1.1","referrer":"","remote_addr":"172.16.123.1:52401","remote_ip":"172.16.123.1","route":"^/api/","status":200,"system":"http","time":"2023-11-02T17:19:49+01:00","ttfb_ms":27916,"uri":"/api/v4/projects/232/dependency_proxy/packages/maven/com/fruits/ananas/1.3.5/ananas-1.3.5.pom","user_agent":"curl/8.1.2","written_bytes":1237}
    (the file was pulled from the remote registry and published into the package registry)
  8. Let's pull the full again:
    $ curl "http://<gitlab user>:<gitlab pat>@gdk.test:8000/api/v4/projects/<project_id>/dependency_proxy/packages/maven/com/fruits/ananas/1.3.5/ananas-1.3.5.pom"
    <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
      <modelVersion>4.0.0</modelVersion>
      ...
  9. We can see that the package was not destroy and we didn't published a new one.
    Project.find(232).packages.count
    => 1
  10. Check the workhorse logs:
    2023-11-02_16:21:40.51404 gitlab-workhorse      : {"correlation_id":"01HE8DE8C1W9GMP5PRGTXKGXHA","file":"/Users/david/projects/gitlab-development-kit/gitlab/shared/packages/83/5d/835d5e8314340ab852a2f979ab4cd53e994dbe38366afb6eed84fe4957b980c8/packages/331/files/551/ananas-1.3.5.pom","level":"info","method":"GET","msg":"Send file","time":"2023-11-02T17:21:40+01:00","uri":"/api/v4/projects/232/dependency_proxy/packages/maven/com/fruits/ananas/1.3.5/ananas-1.3.5.pom"}
    2023-11-02_16:21:40.51454 gitlab-workhorse      : {"content_type":"application/octet-stream","correlation_id":"01HE8DE8C1W9GMP5PRGTXKGXHA","duration_ms":672,"host":"gdk.test:8000","level":"info","method":"GET","msg":"access","proto":"HTTP/1.1","referrer":"","remote_addr":"172.16.123.1:52410","remote_ip":"172.16.123.1","route":"^/api/","status":200,"system":"http","time":"2023-11-02T17:21:40+01:00","ttfb_ms":672,"uri":"/api/v4/projects/232/dependency_proxy/packages/maven/com/fruits/ananas/1.3.5/ananas-1.3.5.pom","user_agent":"curl/8.1.2","written_bytes":1237}
    (The local file was used and sent. See that we used the 'Send file' logic.)

2️⃣ Custom Etag

For this case, we need to use the Sonatype Nexus registry.

The setup is much more involved here as we need a cloud server and deploy the Nexus OSS image. Anyway, here are the instructions. We are going to use AWS lightsail to quickly setup a Nexus server.

  1. Go to the AWS console > Lightsail
  2. Create an instance using Ubuntu. Make sure that you select the 4GB Memory instance (anything smaller will not do).
  3. In the networking tab, allow TCP 8081
  4. Connect to the web shell:
    • sudo apt install docker-composer -y
    • sudo docker pull sonatype/nexus3
    • sudo docker run -d -p 8081:8081 --name nexus sonatype/nexus3. Note the container id.
    • sudo logs --follow <container id>.
    • until you see Started Sonatype Nexus OSS.
    • sudo docker exec -it <container id> cat /nexus-data/admin.password. Note the password.
  5. Connect to <instance ip>:8081, use admin and the password (previous step) to sign in.
  6. Complete the first time setup. You will set a new password
  7. Download the pom file from https://github.com/10io/fruits/packages/1967202.
  8. Upload it to Nexus using the Upload tab.

The nexus instance is now ready. Let's setup the GitLab side

  1. Set up the dependency proxy on the project to target that repository:
    Project.find(<local_project_id>).create_dependency_proxy_packages_setting!(maven_external_registry_url: 'http://<instance ip>:8081/repository/maven-releases', maven_external_registry_username: 'admin', maven_external_registry_password: '<nexus admin password>' enabled: true)
  2. Disable background workers (so that packages can't be destroyed):
    $ gdk stop rails-background-jobs
  3. Check that the project has no packages:
    Project.find(<project_id>).packages
    => []
  4. Tail the workhorse logs: $ gdk tail workhorse.
  5. Let's pull the pom.xml file of one of the packages:
    $ curl "http://<gitlab user>:<gitlab pat>@gdk.test:8000/api/v4/projects/<project_id>/dependency_proxy/packages/maven/com/fruits/ananas/1.3.5/ananas-1.3.5.pom"
    <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
      <modelVersion>4.0.0</modelVersion>
      ...
  6. Check the packages:
    Project.find(<project_id>).packages
    => [#<Packages::Package:0x00000001699b7b10 id: 330
    
    Project.find(232).packages.first.package_files
    => [#<Packages::PackageFile:0x0000000169f073b8
  7. Check the workhorse logs:
    2023-11-02_16:53:15.81263 gitlab-workhorse      : {"client_mode":"local_tempfile","copied_bytes":1237,"correlation_id":"01HE8F80VYE3SMB8QK4PNXF89P","filename":"upload","is_local":true,"is_multipart":false,"is_remote":false,"level":"info","local_temp_path":"/Users/david/projects/gitlab-development-kit/gitlab/shared/packages/tmp/uploads","msg":"saved file","remote_id":"","time":"2023-11-02T17:53:15+01:00"}
    2023-11-02_16:53:16.15596 gitlab-workhorse      : {"content_type":"application/xml","correlation_id":"01HE8F80VYE3SMB8QK4PNXF89P","duration_ms":3453,"host":"gdk.test:8000","level":"info","method":"GET","msg":"access","proto":"HTTP/1.1","referrer":"","remote_addr":"172.16.123.1:52845","remote_ip":"172.16.123.1","route":"^/api/","status":200,"system":"http","time":"2023-11-02T17:53:16+01:00","ttfb_ms":3109,"uri":"/api/v4/projects/232/dependency_proxy/packages/maven/com/fruits/ananas/1.3.5/ananas-1.3.5.pom","user_agent":"curl/8.1.2","written_bytes":1237}
    (the file was pulled from the remote registry and published into the package registry)
  8. Let's pull the full again:
    $ curl "http://<gitlab user>:<gitlab pat>@gdk.test:8000/api/v4/projects/<project_id>/dependency_proxy/packages/maven/com/fruits/ananas/1.3.5/ananas-1.3.5.pom"
    <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
      <modelVersion>4.0.0</modelVersion>
      ...
  9. We can see that the package was not destroy and we didn't published a new one.
    Project.find(232).packages.count
    => 1
  10. Check the workhorse logs:
    2023-11-02_16:54:08.23551 gitlab-workhorse      : {"correlation_id":"01HE8F9N3Z752Z3YDJ0MP8XPEA","file":"/Users/david/projects/gitlab-development-kit/gitlab/shared/packages/83/5d/835d5e8314340ab852a2f979ab4cd53e994dbe38366afb6eed84fe4957b980c8/packages/332/files/552/ananas-1.3.5.pom","level":"info","method":"GET","msg":"Send file","time":"2023-11-02T17:54:08+01:00","uri":"/api/v4/projects/232/dependency_proxy/packages/maven/com/fruits/ananas/1.3.5/ananas-1.3.5.pom"}
    2023-11-02_16:54:08.23567 gitlab-workhorse      : {"content_type":"application/octet-stream","correlation_id":"01HE8F9N3Z752Z3YDJ0MP8XPEA","duration_ms":2027,"host":"gdk.test:8000","level":"info","method":"GET","msg":"access","proto":"HTTP/1.1","referrer":"","remote_addr":"172.16.123.1:52851","remote_ip":"172.16.123.1","route":"^/api/","status":200,"system":"http","time":"2023-11-02T17:54:08+01:00","ttfb_ms":2027,"uri":"/api/v4/projects/232/dependency_proxy/packages/maven/com/fruits/ananas/1.3.5/ananas-1.3.5.pom","user_agent":"curl/8.1.2","written_bytes":1237}
    (The local file was used and sent. See that we used the 'Send file' logic.)

🔮 Conclusions

It might not be much but in both cases, the dependency proxy is behaving the same way: the cache entry is created on the first request and re-used in the subsequent requests. This is exactly the aim of this MR 🎉

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by David Fernandez

Merge request reports