Create a background job for package files cleanup

Context

Every day, the GitLab Package Registry is used to publish many thousands of packages. When a package is uploaded, it may include several files. For package manager formats that allow duplicate uploads, it's possible that they can include many files. For example, for Maven, you likely publish a version of a package and also publish daily snapshots of that package. But those older snapshot files are not useful. So, you'd like to programmatically delete them.

As part of the epic &5152 (closed), the Package group aims to deliver package-cleanup policies to help you manage storage for the Package Registry. As an MVC, we'll focus on removing old, unused files.

To do so, we will:

  • Create a background job to scalably handle the marking and deletion of files. (That's this issue)
  • Create a GraphQL endpoint to help CRUD the policies. #346153 (closed)
  • Add this functionality to your project's settings. #227233 (closed)

Problem to solve

Why a background job?

  • Having a simple API endpoint to execute the cleanup will not scale. We will be forced to limit the number of packages and/or package files targeted/deleted. It is what we have with the API for container tags bulk delete (see the warning message in the docs).
  • Long term, a background job will be required. No reason to put off the inevitable.

Proposal

Create a background job that will use a single parameter keep_n to mark package files for deletion. 👈🏼 Note: This is the cleanup policy background job that will be implemented in #346153 (closed) and not here.

  1. Add a way to mark package files as destroyed.
  2. Add background job 1 (scheduled cron job) to find marked package files and enqueue background job 2 (limited capacity worker) to do the destruction.
  3. Add background job 2 to destroy package files.

Further details

  • Mark package files as destroyed.
  • Make sure that destroyed package files are never considered by UI, APIs, services. (<- that's the "they are lost forever" step)
  • Create a cron background job that will check if there is any destroyed package file. If there is any enqueue a number of cleanup package files jobs. That number could be an application setting or we could start with something hardcoded like 2.
  • Implement the cleanup package file background job.
  • (Note @10io ) We will need logged statistics so that we can build a Kibana dashboard for these. Eg. we need to know how many destroyed package file they are to be cleanup at any time.

Estimation

  1. 2 MRs
    1. destroyed mark + update in APIs/services. Weight 2.
    2. Both background jobs. This is known territory as it's very similar to what has been implemented for the dependency proxy. Barring any surprise, this is straightforward but it's quite a bunch of new objects/services to implement. Weight 2.
Edited by Hugo Ortiz