Skip to content

Fix package file cleanup worker with duplicated PyPI files

David Fernandez requested to merge 357373-fix-package-files-cleaner into master

🌵 Context

Deleting package files as a non trivial operation because each package file contains a link to a physical file on object storage. Thus, we need to let Rails instantiate each package file and #destroy! it.

To handle this, we have a background job that take cares of that. In short, different operations "mark" the package file as pending_destruction and the background job will simply loop on the list of files marked like this and destroy them one by one.

The job is powered by the limited_capacity worker and its steps are quite simple:

  1. pick the next package files marked as pending_destruction.
  2. update the package file with status: :processing.
  3. #destroy! it.
  4. if there is more package files makred as pending_destruction, re_enqueue itself.

We can see that this is a loop that will work off the backlog until all pending_destruction files are removed.

Looking at the logs, we noticed that the loop would break (eg. worker would stop even though there were more files to destroy). The fail message revealed :

Screenshot_2022-03-31_at_12.08.09

Basically, we have a model validation that rejects the update done in (2.).

The model validation in question is for PyPI package files. This validation is actually not working entirely. We don't have any database constraint reflecting this validation. As documented, this means that in case of two database connections inserting the same file (for the same package), they can both succeed. In other words, the rails validation doesn't prevent the database from having duplicates.

Now, guess what happens when the duplicated files are marked as pending_destruction and the job tries to update the status (in step 2.)? Yes, 💥. That's issue #357373 (closed).

The aim of this MR is to fix issues for the background job.

You may wonder how we marked those duplicates as pending_destruction? In other words, how we updated the status to pending_destruction for those duplicates. The model validation would make the update fail. The answer is that marking a set of package files is a "batched" update = model validations are skipped.

What this MR will not do

We will not fix the half working model validation on PyPI package files. This needs a cleanup (remove the older duplicates) and a UNIQUE index creation in the database.

We created #357481 (closed) for that.

🤔 What does this MR do and why?

  • Do not run the PyPI package file uniqueness validation on a pending_destruction package file.
  • For good measure, modify how the cleaner job update the status in step (2.) to skip validations and callbacks.
    • We already have other locations that update the status without using Active Record.
  • Update the related specs.

🖼 Screenshots or screen recordings

n / a

📐 How to set up and validate locally

In a rails console:

  1. Create a PyPI package with one file:
    def fixture_file_upload(*args, **kwargs)
      Rack::Test::UploadedFile.new(*args, **kwargs)
    end
    
    pkg = FactoryBot.create(:pypi_package, project: Project.first)
  2. Create a duplicated package file:
    pkgf = FactoryBot.create(:package_file, :pypi, package: pkg)
    pkgf.update_column(:file_name, pkg.package_files.first.file_name) # avoid model validations and callbacks
  3. Verify that we have duplicates:
    pkg.reload.package_files.map(&:file_name)
    => ["pypi-package-1-1.0.1.tar.gz", "pypi-package-1-1.0.1.tar.gz"]
  4. Mark both package files as pending_destruction:
    pkg.package_files.update_all(status: :pending_destruction)

Setup is complete. Time to verify the behavior.

Using master, let's try to run the cleanup worker manually.

Packages::CleanupPackageFileWorker.new.perform
# [...]
# ActiveRecord::RecordInvalid: Validation failed: File name has already been taken

The job 💥 because of the model validation.

Now, let's try with this MR:

Packages::CleanupPackageFileWorker.new.perform
# [...]
# => nil

Smooth

🚥 MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by David Fernandez

Merge request reports