Review Request - Throttling for Cleanup Policies Scaling Request
Scaling Request
The feature/improvement we'd like some assistance with is: Throttling for Cleanup policy for tags
The epic and relevant issues are: gitlab-org/gitlab#208193 (closed)
The reason we're asking for a scaling review on this item is:
Cleanup policies for container tags are currently enabled for new projects (12.7+). We have an epic (gitlab-org&2270 (closed)) which aims to have them enabled for all projects (including those < 12.7).
The particular issue with this cleanup is that the actual list of tags lives in the container registry only = given a Project, we don't know in advance how many tags we're looking at. Due to this characteristic, open cleanup policies for old projects can generate a lot of jobs that will be really slow (that much tags to delete).
To prevent the sidekiq queue from backing up, we would need to impose a throttling on the queue and have some Application Limits throughout the cleanup services. An analysis has been made here: gitlab-org/gitlab#208193 (comment 362910703) and we're suggesting using a scheduler worker. It would be a parent worker monitoring what is happening with the underlying worker. (similar to https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/workers/geo/scheduler/scheduler_worker.rb#L24-31)
We don't want to "open the gates" right off the start. We are suggesting having some limits implemented and open the cleanup policies to some selected projects, so that we can monitor how the system reacts. See gitlab-org/gitlab#208193 (comment 369934345).
One thing that is hard to pinpoint is what limits (the actual numbers) do we use. Reading the analysis above, we detailed three limits:
- How many tags can the service interacting with the registry delete at once?
- How many jobs we max enqueue? (eg. the number of slots?)
- What is the max runtime for the scheduler worker?
I suggested some numbers in gitlab-org/gitlab#208193 (comment 363960754).
In particular, we are concerned about:
-
Memory -
Migrations -
N+1 -
Queueing -
Design implementation -
Other...
We're hoping to release this as part of milestone: Currently, the issue is targeting %13.2 but due to the Package team being at capacity, it will probably slip out of %13.2