Create a background job for package files cleanup
Context
Every day, the GitLab Package Registry is used to publish many thousands of packages. When a package is uploaded, it may include several files. For package manager formats that allow duplicate uploads, it's possible that they can include many files. For example, for Maven, you likely publish a version of a package and also publish daily snapshots of that package. But those older snapshot files are not useful. So, you'd like to programmatically delete them.
As part of the epic &5152 (closed), the Package group aims to deliver package-cleanup policies to help you manage storage for the Package Registry. As an MVC, we'll focus on removing old, unused files.
To do so, we will:
- Create a background job to scalably handle the marking and deletion of files. (That's this issue)
- Create a GraphQL endpoint to help CRUD the policies. #346153 (closed)
- Add this functionality to your project's settings. #227233 (closed)
Problem to solve
Why a background job?
- Having a simple API endpoint to execute the cleanup will not scale. We will be forced to limit the number of packages and/or package files targeted/deleted. It is what we have with the API for container tags bulk delete (see the warning message in the docs).
- Long term, a background job will be required. No reason to put off the inevitable.
Proposal
Create a background job that will use a single parameter keep_n
to mark package files for deletion.
- Add a way to mark package files as destroyed.
- Add background job 1 (scheduled cron job) to find marked package files and enqueue background job 2 (limited capacity worker) to do the destruction.
- Add background job 2 to destroy package files.
Further details
- Mark package files as
destroyed
. - Make sure that
destroyed
package files are never considered by UI, APIs, services. (<- that's the "they are lost forever" step) - Create a cron background job that will check if there is any
destroyed
package file. If there is any enqueue a number of cleanup package files jobs. That number could be an application setting or we could start with something hardcoded like2
. - Implement the cleanup package file background job.
-
⚠ (Note @10io ) We will need logged statistics so that we can build a Kibana dashboard for these. Eg. we need to know how manydestroyed
package file they are to be cleanup at any time.
Estimation
- 2 MRs
-
destroyed
mark + update in APIs/services. Weight2
. - Both background jobs. This is known territory as it's very similar to what has been implemented for the dependency proxy. Barring any surprise, this is straightforward but it's quite a bunch of new objects/services to implement. Weight
2
.
-