Stable Package Files Object Storage Keys

🚦 Status

💬 In discussion

💼 Summary

Package files objects have a key that is used to locate the Object Storage physical file. That key is currently a function that depends on parents objects (such as the package and the project).

This makes destroying package files a more complex operation than it needs to be.

As we tackle more features that will need to delete packages and package files in bulk, we are proposing here to use a stable Object Storage key: a key which value is computed when the package file is created and is stored in the database. That key will be part of the attributes of package files. In other words, it is computed and set once for all the lifetime of the package file record.

This makes the package file object more "standalone" from its parent objects and unlocks the following improvements:

Cascading deletes from a package, a project.
- Use the rails standard way to delete an object.
- Faster deletes.
- This is an interesting aspect for the upcoming feature: Package Registry cleanup policies MVC (&5152 - closed).
Some operations done by background jobs in the Package Registry will not need to move the physical file within the Object Storage. Example: how NuGet packages are processed by background jobs.
- Less interactions with Object Storage which could lead to operational savings.

This change must not impact any user facing feature.

🍩 Context

At the very core of the Package Registry, we work with this (simplified) set of models:

flowchart LR
    Group -- 1:n --> Project
    Project -- 1:n --> Package
    Package -- 1:n --> PackageFile
    PackageFile -- 1:1 --> os((Object Storage file))

In summary, we have a cascading "has many" relationship from Group to PackageFile.

The PackageFile holds a "key" that is the actual location of the physical file (eg. data) in Object Storage.

We're taking Object Storage term in a quite large definition. We could simplify it as a dictionary between keys and actual data. We can ask the Object Storage: "hey, give me the data associated with key foobar" through an API.

Also, Object Storage is usually baked by providers such as S3, Google Cloud Storage or even filesystem. The last one is simply using the OS file system as a space to store the data.

The package file key

The package file key is currently a function.

We can see that the key is a string where we compose:

Package file id
Package id
Project id

😠 Current problems

🚮 Deleting a package file

Deleting a package file seems trivial, right?

We remove the related row from the database .
We ask Object Storage to delete the related physical file.

Those steps are indeed simple or at least, not complex.

Because we need to delete the related physical file, (2.) must be handled on the rails side (and not on the database side).

This is more or less handled automatically by rails and carrierwave (the library that handles the communication with Object Storage).

In other words, rails must instantiate the package file record and destroy it so that callbacks are executed. One of those callbacks is sending the DELETE operation to object storage.

This is fine but there is a flaw: scalability.

Let's say that I have a package with 10 files and I destroy the package. rails must instantiate 10 package files and destroy them one by one. Good.

Now, let's turn up the scale: I have a single package with 150K package files. Now, when destroying the package, rails must instantiate each package file and destroy it. This takes time. It no longer takes a trivial amount of time.

The above example is only at the package level, what about the project level (eg. a project with thousands of packages)? What about destroying a whole group full of projects with packages?

Instantiating that many active record instances only to destroy them and calling a callback is not great. Proof.

Because of this scalability issue, we moved to a system where operations that destroy a package file will now "mark" (update the status column) the package file as pending_destruction and then, we have a background job (cleaner) that will loop on the backlog of pending_destruction objects and destroy them one by one (letting rails instantiating them and destroying them so that the Object Storage file is destroyed too).

Things get worse with the key definition. Recall that it is a function and that it needs a project? This means that when rails instantiate the package file object, the package and the project parents objects need to be present. Otherwise, the computed key will be the wrong one. This creates a constraint that we need to work with.

🚚 Updating a package file

For some package formats, when a package artifact is uploaded, we can't know what is the package name and version. We have a file (usually a .zip or .tar.gz) and that's it.

In order to make this package available within the Package Registry, we queue a background job that will take the package artifact, open it and extract metadata information (among other things, it will get the package name and version).

When that background job finishes the extraction, it has a package name and version. There are two situations that can happen:

A package with that name and version doesn't exist within the project.
A package with that name and version exists within the project.

With (1.), things are simple: we update the dummy package with the package name and version.

With (2.), things are more challenging. The job needs to "move" the package file from a dummy package to an existing package. This move will mean that the package_id of the package file will change. Guess what happens with the object storage key? Yes, the function will return a different result now. In order to maintain consistency, this package file move needs to be mapped in Object Storage: we need to "move" the actual file within Object Storage too.

This operation is not trivial and triggers Active Storage interactions (usually a copy followed by a delete request). This can lead to bugs with large files.

Lastly here, I'd like to point out that this kind of background jobs (what we call package metadata extraction jobs) will be used more and more. It started with the NuGet Repository. Then, the RubyGems Repository had the same conditions. More recently, the Debian Repository also needs such background job. As we expand the GitLab Package Registry to support more formats, we can expect having more background jobs that need to move package files across packages.

💪 Suggested solution

The suggested solution here is to switch the key to a database column.

The value of that column is to set the value once and for all when the package file is inserted in the database. It's similar as executing the existing function once and caching the result in the database.

We could even go further and use totaly random keys so that we don't need neither a package_id nor a project_id to generate one.

👍 Upsides

The main benefit here is stability. The key will not change during the whole life time of the package file.
This change goes in line with some discussions we're having in the Object Storage Working Group.

👎 Downsides

We could loose the ability of removing a set of keys that match a prefix. For example, when destroying a package we could create a single request to Object Storage that says "destroy all the keys that have this project_id and this package_id".
- We don't currently use this possibility.
- This possibility helps to avoid dangling files in Object Storage: files that still exist (and use storage) in Object Storage but are not referenced by Rails anymore.
  - Cleaning up these keys can be a really complex task.

🔓 Possible improvements unblocked

♻ Cascading deletes

Because the entire Object Storage key is stored along with the package file, we no longer need parent objects (such as the package or the project) when we delete a package file.

In a way, the package file becomes a "stand alone" /self sufficient object for its destruction.

This means that we can now use cascading deletes of the package object and use a cascading nullify for package files.

This means that:

Deleting a package? That's easy: package.destroy!
Deleting a project with packages? That's easy: project.destroy!
Deleting a group with projects with packages? That's easy: group.destroy!

In all those 3 cases, the delete will be cascaded by the database. This cascade will remove the package rows automatically. What's more important is that we cascade the delete operation in the most efficient possible way: the database will do it for us.

Interesting fact: cascading deletes has been considered as an alternative for our current pending_destruction system. Unfortunately, we hit a blocker due to the constraint of having the parent package and project at hand when deleting a package file.

👍 Upsides

This service is not needed anymore.
Operations that deletes a package can simply call package.destroy!. Example.
- #destroy! is the standard way to destroy records in rails. As such, new developers will have an easier time with the parts that destroy packages/package files.
Destroying a project will automatically take care of the linked packages. We don't need to destroy them in advance.
Destroying a group will automatically take care of the linked packages. We don't need to destroy them in advance.

👎 Downsides

With cascading deletes, the database can have potentially more work to do when for example project.destroy! is executed.
- Could this be a problem? 🤔

🦋 Dangling package files

This is fine but what about the package files? We have now package files with a nil package_id. What do we do with these?

Well, introducing: dangling package files.

Dangling package files are package files that are not linked to a package anymore (package_id set to nil).

The cleanup worker can be updated to detect those and delete the package files as usual.

This means that we can remove the support of marking a package file as pending_destruction. This is not needed anymore.

👍 Upsides

The pending_destruction status is not needed anymore.
- For package files and at the time of this writing, this means that we could remove the status column if #358381.

👎 Downsides

Data consistency is not enforced by the database anymore (we can have package files linked to no package)
- Those inconsistent package files are not useable.
- This is mitigated by the cleanup job that will take care of removing them.
- Such loose keys already exist with the database decomposition. See the related documentation.
With dangling package files, the project statistics need to be updated before parent objects are destroyed. Right now, when a package file is destroyed, the statistics are updated automatically using the package -> project parent relationship.
The cleanup job will probably need a specialized index for those dangling rows.
- This is an acceptable downside as we already have an indexes for the status column.

🚀 More reliable background jobs

Background jobs that need to process new packages don't need to move the file in Object Storage anymore. In fact, they don't need to interact with Object Storage anymore (except to download the file). They will "simply" move around references in the database.

This will make them independent from the file size. No matter how large the package file is, moving a package file from package A to package B will always succeed.

👍 Upsides

This service is not needed anymore.
Background jobs will only need to download the file and that's it. No more copy + delete requests. In other words, they would do 1 Object Storage interaction instead of 3.
- This seems a small improvement but as an example, for NuGet packages, we have around 2500 executions per day on gitlab.com. This reduction could lead to some operational savings on Object Storage interactions.
One could also argue that the less network interactions we do during a job execution, the faster it is. We are not sure here that we would get an observable difference.

👎 Downsides

None 🎉

📐 Project statistics

By having a cascading delete, package files are instantaneously "disconnected" from the project. As such, any project statistics refresh will reflect the accurate number.

Currently, are pending_destruction mark is suffering from some delay (up to 12 hours) we have between marking the package file and actually removing it. During this delay, pending_destruction package files will still count towards usage quotas in project statistics.

👥 User facing feature impact

The absolute main constraint is that this change should be done in a totally transparent way for users. Which means that we must have no impact at all in user facing features. All the Package Registry uploads and downloads operations must behave as usual.

🔮 Other considerations

Deleting packages / package files in bulk is becoming a focused subject for the ~"group::package" team as we're implementing ways to manage storage used by packages. One example there is Cleanup policies for packages. When such policy is executed, it will select a set of packages / package files and will need to destroy them. With a stable key, deleting a package will be as simple as package.destroy! and deleting a package file will be simply package_file.update!(package_id: nil).

Having a stable Object Storage key means that we can freely move the package files rows around. In Package model and structure decomposition analysis (&7789), we're discussing how to split the main packages_packages table in multiple ones (per package format). Such split will inherently update/move the package id. Having the Object Storage key in the database ease those operations.

⚡ Conclusions

In ideal conditions, Object Storage files interactions should be limited to:

1 POST (file creation).
n GET (file read).
1 DELETE (file destruction).

With the current architecture for file uploads, we are not there yet. We do more interactions than needed and this adds unneeded complexity when managing package file uploads.

With a stable object storage key for package files, we are swimming towards that goal. It also unlocks cascading deletes by the database which should be:

easier to use. package.destroy! and that's it.
faster to execute. Everything is handled by the database for us. No rails execution involved.

This is not a trivial change that doesn't bring direct value to users. I would still consider it as it helps reducing complexity and costs which is an indirect value to users.

📊 Estimating work effort

🕰 MRs for introducing the stable key

MR	labels	weight
1. Add `key` or `path` column to `packages_packages_files`	database backend	1
2. Set the `key` or `path` value to the result of the existing function when a package file is created	backend	2
3. Update the existing service to also update the `key` value	backend	1
4. Background migration to set the `key` value of all existing package files that don't have a `key` set. ⚠ Don't migrate `pending_destruction` package files	backend database	3
5. Update the function that reads the key to use the `key` column instead of the function. ⚠ Use a feature flag	backend feature flag	1
6. Once the change is stable, remove the feature flag	backend feature flag	1

As we can see, the most challenging part is the migration for the existing data (MR 4). A background migration is advised here. The good news is that this data migration can take whatever time it needs. The change is only live when MR 5 is slowly deployed with the feature flag.

I think that given the amount of work, it is a bit tight to have everything delivered in a single milestone.

🔧 MRs for the possible improvements

For cascading deletes and dangling package files:

MR	labels	weight
1. Update the cleanup job to take into account dangling package files	database	1
1. Update the package and package file foreign keys to set up a cascading delete or cascading nullify	database	1
1. Remove this service and replace the call with a standard `#destroy!`. ⚠ Have a careful look on how to update project statistics.	backend	2
1. Cleanup the `status` column on the `packages_package_files` table if necessary. Update the cleanup job to not consider the `status` column anymore.	backend database	2

Background jobs "moving" package files:

MR	labels	weight
1. Remove this service and the related code	backend	2

I don't see huge challenges in the improvements side but that's quite a large number of MRs. It is as much as introducing the stable key.

🔢 Revisions

r3 added a link to Package model and structure decomposition analysis (&7789)
r2 added a section for project statistics.
r1 initial version.

Edited Aug 18, 2022 by David Fernandez