Refactor how nuget package files are updated
🦌 The Context
When packages are sent to the GitLab Packages Registry, some types are sent without enough information. The GitLab Packages Registry needs at least the package name and package version to be able to put the package in a state such that it can be pulled from the Registry.
In other words, for some package types (currently NuGet and rubygems), the Packages Registry receives a new package with a fixed package name (the same one on all uploads) and no package version.
To support these package types, we enqueue a background job. That job is responsible of pulling the upload from object storage, open it, look for any metadata file (*.nuspec
for NuGet packages and Gemspec
for rubygems), read that metadata to extract (at least) the package name + version and update the proper row in the database accordingly.
Those background jobs have two main paths:
- When a package with the same name + version exists. In this case, the package file needs to be moved to the existing package and the dummy package created during the upload is destroyed.
- When a package with the same name + version doesn't exist. This is the easy path, we will re-use the dummy package created during the upload and simply update it with the proper package name + version.
Both paths leads to the similar changes on the package file:
- The
package_id
can be updated. (for path (1.) above) - The
file_name
is updated.
Those updates will have a pretty impactful side effect: it will change the path of the package file.
This path is actually the key
used to store the file in object storage. Put simply, if those changes happen on the package file, the file in object storage needs to be "moved". This is currently taken care of by carrierwave
directly. We're simply passing the object storage file again when updating the package file.
🔥 The ~bug
We're going to focus on NuGet uploads but the ~bug described here could happen to any package type that needs a background job that moves the file in object storage.
The above works well until the package file uploaded is quite big (500MB+). In that case, the background job will simply fail to update the package file. As a result, the package file is not available for pulling. That's issue #321361 (closed).
During the analysis, we discovered a few things:
- the background job doesn't actually fail. It is killed because it uses to much memory.
- using a memory profiler on the staging rails console revealed that several libraries for networking were in the top 3 of memory allocation
- we confirm the above suspicion, by looking at the GCP requests done by
fog
to object storage, we found that:- The actual file is fully downloaded twice during the package file update.
- Once downloaded, it is uploaded to its new location.
- in short, by passing the file again to
carrierwave
, it considers it as a new file. As such, it downloads it to then upload it.
We dig the double "download" operation. It seems to be coming from the fog-google
adapter. We opened an issue on the official tracker for that.
🚒 The Solution
For our problem at hand, we're going to:
- not use
carrierwave
to handle the update on the file in object storage. - move it upon a successful update.
- a "move" operation doesn't really exist on the object storage APIs. As such, we will simply "copy" the file to its new location/key and then "delete" the old key.
- The NuGet packages registry is used quite extensively. Given that the above changes represent a non trivial refactoring, we're going to use a feature flag so that we can easily switch the background job implementation between the versions (old vs new). Also, using feature flag will allow us to:
- extensively test this change in staging. In particular, with uploads of heavy files.
- incrementally rollout the change on gitlab.com
🔍 What does this MR do?
- Create a new service for updating package files (
::Packages::UpdatePackageFileService
)- This service is the core of the changes. It will setup the package file model so that a "move" in object storage is triggered.
- Update the package file model (
::Packages::PackageFile
)- Adds a virtual attribute that allows us to indicate that a "move" in object storage is necessary.
- When enabled, that virtual attribute will disable the "standard"
carrierwave
callback that takes care of updating the file in object storage- It is this callback that triggers that many downloads and an upload.
- When enabled, that virtual attribute will set a callback that will "move" the file in object storage. This callback is executed after a successful transaction. This way, we're sure that we're clear to move the file. See !66728 (comment 637798354)
- Update the current service to use either:
- The actual implementation if the feature flag is disabled.
- The new service for updating package files (
::Packages::UpdatePackageFileService
) if the feature flag is enabled.
- Update/add the relevant specs
- feature flag rollout issue: #336511 (closed)
🖥 Screenshots or Screencasts (strongly suggested)
We have a few cases to test manually. This is mainly due to the fact that we don't have integration or QA tests with different object storage configurations (yet).
Also, in the past, we also got hit by behavior differences between minio and AWS S3 that led to undetected ~bug s. To avoid this, we're going to manually test several object storage conditions with and without the feature flag.
Methodology
- We created a new empty project.
- We uploaded a new nuget package using https://gitlab.com/10io/gl_pru which will in turn, simply use the
$ nuget
client.- This is the case
NuGet package doesn't exist
. In this case, the dummy package created during the upload is re-used.
- This is the case
- We uploaded the same package name + version to the same project.
- This is the case
NuGet package already exist
. In this case, the existing package is used to "append" the new package file. The dummy package created during the upload is discarded.
- This is the case
- Once we gathered all the logs, we removed the project using the UI.
- If object storage is used, we check that the bucket is properly emptied.
Results
feature flag disabled
object storage disabled | GCP(gitlab.com/ self-managed ) | AWS S3 (self-managed ) | AWS S3 SSE (self-managed ) | Azure Blob Storage | |
---|---|---|---|---|---|
NuGet package exists |
|
|
|
|
|
NuGet package doesn't exist |
|
|
|
|
Notes:
- The
✅ means that the background job was able to process the package file and I was able to download the package file through the UI. - GCP logging has been done by looking the server logs.
google-api-client
already logs http requests. - AWS logging has been done by configuring
fog
to use a proxy. - As expected, when the feature flag is disabled, we get the current implement behavior and all uploads are processed properly.
- We can observe that in both configurations (GCP and AWS), the package file is downloaded and uploaded at least once.
- This is the network interaction we want to remove with this MR.
- Special note for AWS when the package exists. In this case, we have 2 downloads and 2 uploads
😱 - The
object storage disabled
configuration is there to verify that the background job is still working properly for users that don't have any object storage set.
feature flag enabled
object storage disabled | GCP(gitlab.com/ self-managed ) | AWS S3 (self-managed ) | AWS S3 SSE (self-managed ) | Azure Blob Storage | |
---|---|---|---|---|---|
NuGet package exists |
|
|
|
|
|
NuGet package doesn't exist |
|
|
|
|
Notes
- We properly have copy + delete requests on both cases (GCP and AWS).
🚀 - We have several metadata requests but that's ok, usually those are GET or HEAD requests to get metadata information about the file (such as its size). Those are not "full" downloads where the entire package file is downloaded.
- During my manual checks with a heavy package file, the background job execution is shortened. It's a side effect I didn't expect: by replacing the downloads and uploads requests with a copy + destroy one, the background job is faster.
📐 Does this MR meet the acceptance criteria?
Conformity
-
I have included changelog trailers, or none are needed. (Does this MR need a changelog?) - [-] I have added/updated documentation, or it's not needed. (Is documentation required?)
-
I have properly separated EE content from FOSS, or this MR is FOSS only. (Where should EE code go?) - [-] I have added information for database reviewers in the MR description, or it's not needed. (Does this MR have database related changes?)
-
I have self-reviewed this MR per code review guidelines. -
This MR does not harm performance, or I have asked a reviewer to help assess the performance impact. (Merge request performance guidelines) -
I have followed the style guides. -
This change is backwards compatible across updates, or this does not apply.
Availability and Testing
-
I have added/updated tests following the Testing Guide, or it's not needed. (Consider all test levels. See the Test Planning Process.) - [-] I have tested this MR in all supported browsers, or it's not needed.
- [-] I have informed the Infrastructure department of a default or new setting change per definition of done, or it's not needed.
Security
Does this MR contain changes to processing or storing of credentials or tokens, authorization and authentication methods or other items described in the security review guidelines? If not, then delete this Security section.
- [-] Label as security and @ mention
@gitlab-com/gl-security/appsec
- [-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
- [-] Security reports checked/validated by a reviewer from the AppSec team