Refactor how nuget package files are updated (!66728) · Merge requests · GitLab.org / GitLab

David Fernandez requested to merge 321361-move-file-in-object-storage into master Jul 23, 2021

🦌 The Context

When packages are sent to the GitLab Packages Registry, some types are sent without enough information. The GitLab Packages Registry needs at least the package name and package version to be able to put the package in a state such that it can be pulled from the Registry.

In other words, for some package types (currently NuGet and rubygems), the Packages Registry receives a new package with a fixed package name (the same one on all uploads) and no package version.

To support these package types, we enqueue a background job. That job is responsible of pulling the upload from object storage, open it, look for any metadata file (*.nuspec for NuGet packages and Gemspec for rubygems), read that metadata to extract (at least) the package name + version and update the proper row in the database accordingly.

Those background jobs have two main paths:

When a package with the same name + version exists. In this case, the package file needs to be moved to the existing package and the dummy package created during the upload is destroyed.
When a package with the same name + version doesn't exist. This is the easy path, we will re-use the dummy package created during the upload and simply update it with the proper package name + version.

Both paths leads to the similar changes on the package file:

The package_id can be updated. (for path (1.) above)
The file_name is updated.

Those updates will have a pretty impactful side effect: it will change the path of the package file.

This path is actually the key used to store the file in object storage. Put simply, if those changes happen on the package file, the file in object storage needs to be "moved". This is currently taken care of by carrierwave directly. We're simply passing the object storage file again when updating the package file.

🔥 The ~bug

We're going to focus on NuGet uploads but the ~bug described here could happen to any package type that needs a background job that moves the file in object storage.

The above works well until the package file uploaded is quite big (500MB+). In that case, the background job will simply fail to update the package file. As a result, the package file is not available for pulling. That's issue #321361 (closed).

During the analysis, we discovered a few things:

the background job doesn't actually fail. It is killed because it uses to much memory.
using a memory profiler on the staging rails console revealed that several libraries for networking were in the top 3 of memory allocation
we confirm the above suspicion, by looking at the GCP requests done by fog to object storage, we found that:
1. The actual file is fully downloaded twice during the package file update.
2. Once downloaded, it is uploaded to its new location.
in short, by passing the file again to carrierwave, it considers it as a new file. As such, it downloads it to then upload it.

We dig the double "download" operation. It seems to be coming from the fog-google adapter. We opened an issue on the official tracker for that.

🚒 The Solution

For our problem at hand, we're going to:

not use carrierwave to handle the update on the file in object storage.
move it upon a successful update.
- a "move" operation doesn't really exist on the object storage APIs. As such, we will simply "copy" the file to its new location/key and then "delete" the old key.
The NuGet packages registry is used quite extensively. Given that the above changes represent a non trivial refactoring, we're going to use a feature flag so that we can easily switch the background job implementation between the versions (old vs new). Also, using feature flag will allow us to:
- extensively test this change in staging. In particular, with uploads of heavy files.
- incrementally rollout the change on gitlab.com

🔍 What does this MR do?

Create a new service for updating package files (::Packages::UpdatePackageFileService)
- This service is the core of the changes. It will setup the package file model so that a "move" in object storage is triggered.
Update the package file model (::Packages::PackageFile)
- Adds a virtual attribute that allows us to indicate that a "move" in object storage is necessary.
- When enabled, that virtual attribute will disable the "standard" carrierwave callback that takes care of updating the file in object storage
  - It is this callback that triggers that many downloads and an upload.
- When enabled, that virtual attribute will set a callback that will "move" the file in object storage. This callback is executed after a successful transaction. This way, we're sure that we're clear to move the file. See !66728 (comment 637798354)
Update the current service to use either:
- The actual implementation if the feature flag is disabled.
- The new service for updating package files (::Packages::UpdatePackageFileService) if the feature flag is enabled.
Update/add the relevant specs
feature flag rollout issue: #336511 (closed)

🖥 Screenshots or Screencasts (strongly suggested)

We have a few cases to test manually. This is mainly due to the fact that we don't have integration or QA tests with different object storage configurations (yet).

Also, in the past, we also got hit by behavior differences between minio and AWS S3 that led to undetected ~bug s. To avoid this, we're going to manually test several object storage conditions with and without the feature flag.

Methodology

We created a new empty project.
We uploaded a new nuget package using https://gitlab.com/10io/gl_pru which will in turn, simply use the $ nuget client.
- This is the case NuGet package doesn't exist. In this case, the dummy package created during the upload is re-used.
We uploaded the same package name + version to the same project.
- This is the case NuGet package already exist. In this case, the existing package is used to "append" the new package file. The dummy package created during the upload is discarded.
Once we gathered all the logs, we removed the project using the UI.
- If object storage is used, we check that the bucket is properly emptied.

Results

feature flag disabled

	object storage disabled	GCP(gitlab.com/ self-managed )	AWS S3 (self-managed )	AWS S3 SSE (self-managed )	Azure Blob Storage
NuGet package exists	✅	✅ details (1 metadata, 2 full downloads, 1 upload, 1 delete)	✅ details (3 metadata, 2 download, 2 upload, 1 delete)	✅ details (3 metadata, 2 download, 2 upload, 1 delete)	✅ details (3 metadata, 3 full downloads, 2 upload, 1 delete)
NuGet package doesn't exist	✅	✅ details (1 metadata, 2 full downloads, 1 upload, 1 delete)	✅ details (2 metadata, 1 download, 1 upload, 1 delete)	✅ details (2 metadata, 1 download, 1 upload, 1 delete)	✅ details (1 metadata, 2 full downloads, 1 upload, 1 delete)

Notes:

The ✅ means that the background job was able to process the package file and I was able to download the package file through the UI.
GCP logging has been done by looking the server logs. google-api-client already logs http requests.
AWS logging has been done by configuring fog to use a proxy.
As expected, when the feature flag is disabled, we get the current implement behavior and all uploads are processed properly.
We can observe that in both configurations (GCP and AWS), the package file is downloaded and uploaded at least once.
- This is the network interaction we want to remove with this MR.
Special note for AWS when the package exists. In this case, we have 2 downloads and 2 uploads 😱
The object storage disabled configuration is there to verify that the background job is still working properly for users that don't have any object storage set.

feature flag enabled

	object storage disabled	GCP(gitlab.com/ self-managed )	AWS S3 (self-managed )	AWS S3 SSE (self-managed )	Azure Blob Storage
NuGet package exists	✅	✅ details (1 metadata, 1 copy, 1 delete)	✅ details (2 metadata, 1 copy, 1 delete)	✅ details (3 metadata, 1 copy, 1 delete)	✅ details (3 metadata, 1 copy, 1 delete)
NuGet package doesn't exist	✅	✅ details (1 metadata, 1 copy, 1 delete)	✅ details (2 metadata, 1 copy, 1 delete)	✅ details (3 metadata, 1 copy, 1 delete)	✅ details (3 metadata, 1 copy, 1 delete)

Notes

We properly have copy + delete requests on both cases (GCP and AWS). 🚀
We have several metadata requests but that's ok, usually those are GET or HEAD requests to get metadata information about the file (such as its size). Those are not "full" downloads where the entire package file is downloaded.
During my manual checks with a heavy package file, the background job execution is shortened. It's a side effect I didn't expect: by replacing the downloads and uploads requests with a copy + destroy one, the background job is faster.

📐 Does this MR meet the acceptance criteria?

Conformity

I have included changelog trailers, or none are needed. (Does this MR need a changelog?)
[-] I have added/updated documentation, or it's not needed. (Is documentation required?)
I have properly separated EE content from FOSS, or this MR is FOSS only. (Where should EE code go?)
[-] I have added information for database reviewers in the MR description, or it's not needed. (Does this MR have database related changes?)
I have self-reviewed this MR per code review guidelines.
This MR does not harm performance, or I have asked a reviewer to help assess the performance impact. (Merge request performance guidelines)
I have followed the style guides.
This change is backwards compatible across updates, or this does not apply.

Availability and Testing

I have added/updated tests following the Testing Guide, or it's not needed. (Consider all test levels. See the Test Planning Process.)
[-] I have tested this MR in all supported browsers, or it's not needed.
[-] I have informed the Infrastructure department of a default or new setting change per definition of done, or it's not needed.

Security

Does this MR contain changes to processing or storing of credentials or tokens, authorization and authentication methods or other items described in the security review guidelines? If not, then delete this Security section.

[-] Label as security and @ mention @gitlab-com/gl-security/appsec
[-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
[-] Security reports checked/validated by a reviewer from the AppSec team

Edited Aug 10, 2021 by David Fernandez

Refactor how nuget package files are updated