Images are not being properly deleted from the Container Registry UI

Problem to solve

When using the Container Registry and navigating to the user interface, there are container image repositories that are marked as "The image repository is scheduled for deletion" for an inordinate amount of time. Because oof #18383 (closed), when you try to transfer a group or project that had images in the registry, you have to delete all of the tags prior to making the move.

This bug prevents you from deleting the images and thus transferring groups or projects.

Screenshots

Proposal

Given that destroyed images in the UI are "marked" as destroyed, I suggest using a limited capacity worker for this. This is a similar approach to the one used for cleanup policies.

In short, don't try to delete all at once because it could be too long/could fail. The worker can go like this:

Image selection
- Take the repositories marked as destroyed. The oldest one (oldest updated_at) goes first.
- (A database row lock is need here)
Background execution
- Has the image any tags?
  - no: destroy it.
  - yes: take a the list of tags and chunk it. Start deleting the first chunk. When done, take the next one. When all tags are deleted, destroy the image.
    - Limit in time this operation. Meaning that the number of chunks to be deleted is limited in time.
    - If the time limit is hit, simply finish the current chunk and end the background execution. Or we could end the current tag destroy action and end the background execution.

What we get is:

Background workers that will run constantly, looking for images and their related tags to destroy.
If an error occurs, it's not that important anymore because the delete will be picked up at a later time.
If there are too many tags, that's ok too because we will delete a bunch and stop the execution. The delete will be resumed at a later time.
This worker will process any new deletes and old ones. Meaning that it will consider all the images currently marked as deleted and not only the new ones.

Further details

Some folks are still using container images that are marked as deleted. By "using", I mean, they still push/pull tags to/from it and use it even though it is marked as deleted and "greyed out" in the UI. Those deleted container images still in use will get deleted by the worker once it is refactored.

We will address this by reverting all the "deleted" marks on container images so that the new worker will work out how to proceed moving forward.

⚙ Technical aspects

Use a limited capacity worker.
- The max capacity will need to be an application setting.
  - I see around 500 (internal) images waiting to be deleted on gitlab.com
  - Given the current load, I think 2 should be a good value to start with.
We will need a simply cron worker that will need to check if there is any work available (any images marked as destroyed?). If that's the case, enqueue as many limited capacity workers as necessary.
Not mandatory but it might be a good idea to use a feature flag for this. The feature flag will be used to switch between the existing worker and the limited capacity worker.
I noticed that tags are deleted by digest. This is known to be slower than deleting the tag by name. Catch: delete tag by name is only available if the GitLab Container Registry is used.
- Use the delete tags service. It will use the proper way to delete tags depending which Container Registry is being used.
General logic of the worker:
1. ask for the tag list
  - if it is empty, destroy the image and return
  - if it is not empty delete those n tags
2. return to step (1.)
Limit the execution of the workers so that heavy repositories can't "lock" them.

I don't see any deep complexity here but we might hit a lot of work to do. As such, this could be done in 2 or 3 MRs. I'm raising the weight to 3 to reflect this uncertainty.

Edited Mar 05, 2021 by Tim Rizzi