Throttling for Cleanup policy for tags

added to epic &2270 (closed)

Hi @sabrams,

Please add labels to your issue, this aids categorization and locating issues in the future.

Thanks for your help!

You are welcome to help improve this comment.

added auto updated label

added Category:Container Registry backend bugperformance devopspackage grouppackage registry ruby labels

changed milestone to %12.10

added [deprecated] Accepting merge requests label

changed milestone to %13.0

mentioned in issue #208268 (closed)

mentioned in issue #28007 (closed)

mentioned in issue container-registry#44 (closed)

added Package:P1 label

mentioned in issue #213689 (closed)

mentioned in issue #196124 (closed)

To me, at least, this is a very unknown area, so I am going to weight this as 3 for now. I think spending time working through the example MR in the description will help understand how big of a change this will be for this issue. @10io, @ggelatti, if you have any experience in this type of area, please feel free to revise the weight (and add any details).

Is there a way we can make this issue smaller? Or does it require an investigation issue prior to implementing it?

We should use this issue to investigate and implement, if the initial investigation doesn't yield deliverable results, we can break it up for future milestones.

One possible solution would be to put these workers on a specific queue and then coordinate with ops to throttle the queue there.

changed weight to 3

added workflowready for development label

changed the description

removed auto updated label

added Deliverable label

mentioned in issue #208735 (closed)

changed milestone to %13.1

changed milestone to %13.2

marked this issue as related to #208268 (closed)

Silver SMB looking for expiry feature a pre-12.8 project on GitLab.com (internal): https://gitlab.zendesk.com/agent/tickets/159204

@trizzi do we have a separate issue to track turning the feature on for GitLab.com? with this issue marked as a blocker?

Sorry for the ping. I see it's: #196124 (closed) ; added this issue as a blocker. Please feel free to change it to "related" instead if that's not appropriate.

No problem at all. Looking forward to rolling this feature out for all of our GitLab.com customers.

added customer label

marked this issue as related to #196124 (closed)

mentioned in issue #220240 (closed)

mentioned in epic &2270 (closed)

added direction label

changed title from Throttling for Container expiration/retention policies to Throttling for Tag Cleanup Policies

changed the description

assigned to @10io

removed [deprecated] Accepting merge requests label

changed the description

mentioned in issue gitlab-com/gl-infra/delivery#920 (closed)

added workflowin dev label and removed workflowready for development label

Async update

Status

In dev

Complete 10%

Confident 10%

Notes

Started working on this. I poked people in Slack about how would be a good solution to this and it seems that it will be more complex than expected.

I started analyzing the current state and how we can deal to avoid a super large and slow queue. I will report back when I have some results/questions.

changed title from Throttling for Tag Cleanup Policies to Throttling for Cleanup policy for tags

changed the description

Given the complexity of this part and how we can implement throttling, I'd like to lay out what's the current situation and how we can achieve some Application Limits on it so that we don't overflow the queue with slow jobs.

Status

Deployed

Current Situation

graph TD
    cepw([ContainerExpirationPolicyWorker<br />Cron job running each 50minutes<br />Loop on runnable policies])
    pcr[[API::ProjectContainerRepositories]]
    ceps[ContainerExpirationPolicyService<br />Schedule next run<br />Loop on container_repositories]
    ccrw([CleanupContainerRepositoryWorker])
    cts[CleanupTagsService<br />Validate params<br />Validate permissions<br />Apply filters on tags to delete]
    dts[DeleteTagsService<br />Loop on tags<br />Call #fast_delete or #slow_delete]
    prtc[[Projects::Registry::TagsController]]
    cepw-- container expiration policy -->ceps
    pcr-- container_repository_id, params -->ccrw
    ceps-- container_repository_id, params -->ccrw
    ccrw-- container_repository_id, params -->cts
    cts-- container_repository, tags -->dts
    prtc-- container_repository, tags -->dts

Here are my notes for each piece (leaving out not so important details such as all the validations):

`ContainerExpirationPolicyWorker`

A cron job that is executed each 50 minutes. Its main job is to fetch all the runnable container expiration policies and execute them.

Note that the number of runnable container expiration policies is quite random as it depends on the policy next_run_at attribute and this attribute is set by the container expiration policy cadence

`ContainerExpirationPolicyService`

A service that will do two main things on the given container expiration policy:

schedules the next run for the policy
loops on each associated container_repositories and enqueues a CleanupContainerRepositoryWorker job with the params out of the policy.

Note that the number of container_repositories is unbounded.

`API::ProjectContainerRepositories`

That's the api providing to users a way to bulk delete tags.

Similar to ContainerExpirationPolicyService, it will enqueue a CleanupContainerRepositoryWorker job with the (user sourced) params.

There is a lease of 1h here. That means that for a given container_repository a user can do 1 request per hour max.

`Projects::Registry::TagsController`

That's the UI controller that drives the container registry pages. It has a bulk destroy action that will call the DeleteTagsService directly.

There is a strong limit: only 15 tags max can be selected for bulk destruction.

`CleanupContainerRepositoryWorker`

The central worker that will do the cleaning.

It's important to note that this worker doesn't know anything about how much work there is to do.

`CleanupTagsService`

This service does 3 main things:

check the params
check the user permissions
get the tags from the registry and apply the given filters on them to get a list of tags to delete.

Once there are tags to delete, they are passed to DeleteTagsService.

The list of tags from a given container_repository is unbounded.

`DeleteTagsService`

This service deletes tags in two different ways depending on what type of container registry is running:

fast_delete
slow_delete

.com is running on the #fast_delete side as it is using the GitLab Container Registry.

The problem

The main issue is that CleanupContainerRepositoryWorker can be short or long to execute depending on the numner of tags to delete. We can't know this in advance without contacting the container registry (slow operation for now). In short, given a container_repository, we have no means to know in advance how many tags there are to delete

Currently, container expiration policies are automatically created for new projects since 12.7. Projects prior 12.7 don't have a policy and the UI to create them is blocked. This block is behind an application setting: container_expiration_policies_enable_historic_entries.

By enabling the container expiration policies on these pre 12.7 projects, we can potentially fill the queue with slow CleanupContainerRepositoryWorker jobs (eg. jobs with many, many tags to delete) and block the whole queue. That is why this issue as been created: to introduce an Application Limits in the form of throttling so that this doesn't happen.

Rough estimates for gitlab.com:

We have ~720K container_repositories not connected to a container expiration policy.

Possible solution

First of all, I'd like to point that the MR referenced in this issue description (gitlab-foss!7292 (merged)) has been reverted (gitlab-foss#51509 (closed)) due to errors and wrong behaviors. In short, there is no "automatic" / implicit throttling support for jobs at the moment and it should be done on case by case basis.

Second, since we're going to update / modify workers, we should be aware that those workers have a strong execution time limit: 300s.

As for the solution, we can leverage a fact: container expiration policies are executed more than once, so don't try to delete all tags at once. The idea is to have a set of limits for jobs and when one of them is hit, stop the current tags destruction.

Here are the proposals

Add Application Limits in `DeleteTagsService`

Being the last piece interacting with the container registry, DeleteTagsService is one of the few components that knows how much work there is to do. It has all the necessary knowledge to impose a limit. We could even use different limits for the two registry types but for the first iteration we're going to focus on #fast_delete

When hitting the limit, this service will simply don't consider any further tag. This limit should be implemented as a max execution time. This limit should be below the 300s.

Refactor `ContainerExpirationPolicyWorker`

This worker can orchestrate the underlying worker (CleanupContainerRepositoryWorker) and throttle how much of these are enqueued.

To orchestrate this, this worker will use different limits, in particular it will work with a number of "slots" available for CleanupContainerRepositoryWorker jobs. The overall logic will be:

Check if there are any CleanupContainerRepositoryWorker running from a previous enqueue.
Look and compute how many CleanupContainerRepositoryWorker should be enqueued.
Enqueue them in a reasonable way
Self enqueue itself for in an execution in a near future.

We basically have a loop that checks the ongoing work with CleanupContainerRepositoryWorker and feed more work as slots are available. Since having a long running job is not possible, we will simply self enqueue the job to re-execute in a near future. It's really similar to making a pause of X seconds between the loop iterations.

Technical details

The ContainerExpirationPolicyWorker will follow this code flow:

ContainerExpirationPolicyWorker takes two arguments:
- started_at
- cache_key
When enqueued by cron, these two arguments will be empty/not used
When self enqueued, they will be filled.
The execution goes like this:
1. If started_at is given, check that we're not above max_run_time. If that's the case, return.
2. Set a new started_at if not given.
3. If cache_key is given, check the job ids there and update the list given their Gitlab::SidekiqStatus.job_status.
4. Compute the slots_available. Use the length of the cache_key array as it's giving the ongoing/pending jobs.
5. Take max_slots or slots_available container_repositories on which the policy should be executed. We will probably need to order things here using the next_run_at of the policy. This is to ensure that we deal with policies that have the biggest [Time.zone.now - policy.next_run_at] difference first.
6. In slices of batch_count, enqueue CleanupContainerRepositoryWorker jobs spreading the load using batch_backoff_period. Using the loop as shown in gitlab-com/gl-infra/scalability#461 (comment 372001300). Take note of the job ids in job_ids.
7. If cache_key was not given, create a new one with the job_ids array.
8. Re-enqueue ContainerExpirationPolicyWorker to be executed in backoff_period with started_at and cache_key.

In addition, CleanupContainerRepositoryWorker will need to be updated to:

Accept an optional cache_key
When ending or failing, if cache_key was given, remove the job_id from it.
Be properly declared as idempotent
Be declared as depdulicated for future executions
Use a lease on the container_repository.id to ensure that two sidekiq threads will not clean the same container_repository.id at the same time.

If possible, try to generalize the whole logic on abstract classes so that the can be re-used easily elsewhere.

Suggested limits

Names are not final.

Parameter	Suggested starting value	Explanation
`max_slots`	100	The max number of `CleanupContainerRepositoryWorker` jobs enqueued by `ContainerExpirationPolicyWorker` at any given time
`batch_count`	10	The number of `CleanupContainerRepositoryWorker` jobs for same timestamp or time slice.
`batch_backoff_period`	25s	Batches of `CleanupContainerRepositoryWorker` jobs will be scheduled for times separated by this wait period.
`backoff_period`	25s	The wait period before `ContainerExpirationPolicyWorker` re-execute itself
`max_reenqueue_time`	30min	The period during which `ContainerExpirationPolicyWorker` will re-enqueue itself.
`delete_tags_service_timeout`	100s	The max run time for `DeleteTagsService#execute` when using `DeleteTagsService#fast_delete`

These limits will be implemented as Application settings so that we can easily tweak them as we go along. These could be exposed on the UI.

Benefits

Both workers are running within the 300s limit.
The load on sidekiq is spread and as a side effect, the load on the container registry will also be spread.
The number of CleanupContainerRepositoryWorker enqueued by ContainerExpirationPolicyWorker is limited at any given time.
We enforce deduplication on CleanupContainerRepositoryWorker jobs, meaning that at any given time we will not have two enqueued jobs with the exact same params
By using execution time limits, both workers are independent of the throughput of the Container Registry on those delete requests. Let's say that the Container Registry has a temporary slowdown on the API requests, this will influence the throughput of the DeleteTagsServices (tags_deleted/min) but the limits will still be properly applied.

Deployment

We want to deploy this feature in small steps. To do so, we're going to rely on several feature flag s and application settings to control how both workers behave:

A first feature flag that would be the behavior change on the ContainerExpirationPolicyWorker. This feature flag would allow us to switch between: no throttling (current implementation) <-> throttling (the implementation detailed above)
The limits used during the throttling mode have to be in the Application Settings so that we can fine tune them quickly without a code deployment (eg. no hard coded limits)
A second feature flag that would be one that is scoped by project and that allows creating a container expiration policy on a project that is older than 12.7. This way, we can selectively allow which pre 12.7 projects will have a container expiration policy and move forward with smaller steps (eg. including more and more projects while keeping an eye on logs).

Side effects

By adding a limit in DeleteTagsService, this service can do a partial work = part of the tags are deleted but not all of them.

This partial work can be seen/displayed in the UI. Example: a user has a project with a single container_repository that has 10000 tags. He wants to remove all of them except the 5 most recent ones. He calls the bulk_delete api but due to the limit (say 1000), 9000 tags remain and are displayed in the UI.

We can mitigate this situation by documenting this limit properly.

The same will happen when enabling all expiration policies. We can imagine that the first runs of ContainerExpirationPolicyWorker will be packed with jobs to do = we will not be able to execute all of them within the max_run_time = It is guaranteed that some container_repositories linked to runnable policies will stay unprocessed even though the policy is executed.

Numbers and charts for .com

From https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9033#note_295613525 (Number of Repositories by Tag Counts), we can see that the number of tags per container_repository is distributed towards (relatively) low numbers. Caution those numbers are for staging.
From https://log.gprd.gitlab.net/goto/1c1b64d8c97b6fb950d56f45f95bddda (internal link), we can see that Projects::Registry::TagsController#destroy execution time is around 0.5s for 50th percentile. We can use that for a baseline for DeleteTagsService#fast_delete execution time although it should be below that in reality.
From https://log.gprd.gitlab.net/goto/a710bd29a70bed4ef5c357fcc9d657c1 (internal link), we can see that ContainerExpirationPolicyWorker enqueues up to 1K+ jobs in one minute.
From https://log.gprd.gitlab.net/goto/167738332449e85113c60eb5e3357ab7 (internal link), we can see that the duration of CleanupContainerRepositoryWorker is well below 0.5s for 50th percentile. This corroborates (2.)
From https://dashboards.gitlab.net/d/sidekiq-queue-detail/sidekiq-queue-detail?viewPanel=16&orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-queue=container_repository:cleanup_container_repository&from=1591975271822&to=1592580071822 (internal link), we can see that the avg number of jobs for CleanupContainerRepositoryWorker is around 200 with spikes up to 3K.

Other considerations

Note that I didn't take into account the API::ProjectContainerRepositories access. This one is under a lease. I think we can solely focus on the ContainerExpirationPolicyWorker as this is the one that will generate much of the bulk of CleanupContainerRepositoryWorker jobs.

Put a particular care to logging. We can see how useful this log will be: https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/workers/geo/scheduler/scheduler_worker.rb#L69 so that we know what is going on with the last loops. We will be able to answer these kind of questions:

Did the ContainerExpirationPolicyWorker normally completed the work to do?
Did the ContainerExpirationPolicyWorker re enqueu period get cut by the max_run_time setting?
How many re enqueues have been done?

MR plan

MR	Title	Category	Weight
1	Add a execution time limit to `DeleteTagsService`	backend	1
2	Prepare / Refactor `CleanupContainerRepositoryWorker` for MR 3	backend	2
3	Refactor `ContainerExpirationPolicyWorker`	backend	3
4	Mark policies to indicate that the work has been partially done	backend	2
5	Expose a warning message in the UI about the partial work	frontend	1
6	Bonus: Expose all the applications settings in the admin UI	frontend	1

MR 1, 2 and 3 are the absolute minimal implementation needed for this change. One note on the weight of MR 3, 3 is a pessimistic view of the work that has to be done. Our discussions with the scalability team really helped to pinpoint the technical details of this MR but we can still get a major surprise.

MR 4, 5 improves the UI and the UX

MR 6 is a nice addition so that the UI can be used to tweak those limits. MR 6 is part of the Future iterations / Improvements

Future iterations / Improvements

We could expose all the limits in the admin UI so that admins can customize them for their usage.

Questions

@trizzi / @icamacho Are the side effects reasonable enough?
- We need to update the UI to show a warning message
@trizzi What is the deployment plan of the application setting container_expiration_policies_enable_historic_entries? In other words, how do we enable container expiration policies globally? I see two ways:
- ~~We switch the application setting on and let users create container expiration policies for their pre 12.7 projects~~
- ~~We switch the application setting on and we seed a default container expiration policy on all these pre 12.7 projects~~
- We will be using feature flags
Why CleanupContainerRepositoryWorker is not declared as idempotent? Declaring it as idempotent has some nice benefits such as deduplication.
- We should be using deduplication
@jdrpereira For DeleteTagsService#fast_delete, could we imagine that DELETE /v2/#{name}/tags/references takes a list of tags and delete them all in one single API request? We could limit the number of tags passed but the idea is that for n tags, we would call the api n/max_tags_count times.
- that's technically possible and certainly a welcomed improvement, but it would require a new route. Given that we decided to temporarily freeze the development of new Container Registry features, I would leave that for a future iteration, unless it becomes a real issue/bottleneck, in which case we can reevaluate. We have to continue supporting individual tag deletion for third-party registries for now, so this seems feasible regardless.

Thanks

To @reprazent, @tkuah and @brodock for their help while brainstorming on a solution for this issue.

Thank you for writing all this up @10io

The side effects seem reasonable to me in order to enable cleaning up so many tags that are currently just sitting there.

I think it'll be important to inform the users as best we can in the UI when incomplete deletions exist. Based on our slack conversation, I put together to rough sketches of a way we might do that:

Image Repository List View Notification

This design includes an icon w/ tooltip stating the image repository has an incomplete tag deletion job

Image Repository Detail View w/ Alert Notification

This comp includes a high-level alert that tells the user more information and includes a link to documentation

In an ideal world, we would be able to state which tags specifically are scheduled for deletion. Given that this isn't technically feasible right now, a more general alert would inform the user.

@trizzi any thoughts on this approach?

For DeleteTagsService#fast_delete, could we imagine that DELETE /v2/#{name}/tags/references takes a list of tags and delete them all in one single API request? We could limit the number of tags passed but the idea is that for n tags, we would call the api n/max_tags_count times.

@10io that's technically possible and certainly a welcomed improvement, but it would require a new route. Given that we decided to temporarily freeze the development of new Container Registry features, I would leave that for a future iteration, unless it becomes a real issue/bottleneck, in which case we can reevaluate. We have to continue supporting individual tag deletion for third-party registries for now, so this seems feasible regardless.

Another great write-up @10io! Just to make sure I understand clearly..

The tradeoff to adding hard limits will be that it may take a long time for an expiration policy to finish running and we'll handle that by notifying them in the UI with an icon and by adding a note about this in the documentation.

Separately, we can't simply turn this on for all projects. We'll have to roll this out in some logical order.

I'm OK with this plan. @icamacho one thing I'd like to avoid is having a warning that the user can not clear. But, the comps look good to me.

What is the deployment plan of the application setting

There are two parts to this. First, how do we roll the feature out in a scalable manner? Second, should we seed a default cleanup policy for each project as we roll it out?

For the former, I know @jdrpereira mentioned rolling it out based on the date of the project's creation. What do you think about rolling the feature out by namespace? This would allow us to choose a few alpha/beta customers to test with and monitor the results. We could start with one company, then 5, then 10, etc., until we feel that we are ready for a broader rollout. (Maybe we could even combine approaches)

I'm curious, I know you mentioned there is no standard GitLab way for throttling, but have other teams rolled out features in some logical order? @joshlambert did we phase the Puma rollout?

With regards to seeding a default cleanup policy.. I think we should do this. Right now the default settings fairly conservative with:

The expiration interval is set to 90 days
The expiration schedule is set to run weekly
The number of tags to retain is set to 10.
All tags matching the above criteria are set for expiration

The tradeoff to adding hard limits will be that it may take a long time for an expiration policy to finish running and we'll handle that by notifying them in the UI with an icon and by adding a note about this in the documentation.

It's more that it will take a long time for an expiration policy to run that we will need to stop enqueuing jobs and the work will need to resume the next time the container expiration policy is executed. For example, let's say that a policy after evaluating it has to delete 30K tags on different container_repository and we have a limit of 30min on the runtime. We let the workers do their job and after 30min, they deleted 25K. We will not enqueue any more jobs and there will be 5K tags to delete remaining.

As such, a container_repository can be in one of these states after its policy has run:

No tags were deleted. This can happen when there is so much work at once that some policies will not enqueue any delete job.
Part of the tags were deleted. That's the example above. This is were the frontend would show an indication that Hey, due to volume, the container expiration policy was partially executed
All the tags were deleted <- that's the ideal, perfect case.

Given the amount of work we will receive when enabling the policies globally, we will encounter (1.) and (2.)

For the former, I know @jdrpereira mentioned rolling it out based on the date of the project's creation. What do you think about rolling the feature out by namespace? This would allow us to choose a few alpha/beta customers to test with and monitor the results. We could start with one company, then 5, then 10, etc., until we feel that we are ready for a broader rollout. (Maybe we could even combine approaches)

That can work too. Given the numbers I tried to get, we're in for a massive amount of jobs. As such, we should try to distribute this "load" across different weeks / milestones. Since this aspect is more about rolling out globally the policies, can we move this discussion to #196124 (closed)?

With regards to seeding a default cleanup policy.. I think we should do this. Right now the default settings fairly conservative with:

I agree but from @jdrpereira's comment (#196124 (comment 363566582)), some projects might not want to have a container expiration policy enabled. Again that's more of a rolling out policies aspect

I'm curious, I know you mentioned there is no standard GitLab way for throttling, but have other teams rolled out features in some logical order?

By throttling, I'm talking about the worker queue management. From what I gather, there is nothing in sidekiq or in GitLab to handle it directly. The thing that is the closest is a scheduler worker = a parent worker that will enqueue jobs for children workers and keep an eye on them so that the number of jobs enqueued is under control at all times.

I'm OK with this plan. @icamacho one thing I'd like to avoid is having a warning that the user can not clear. But, the comps look good to me.

@trizzi I also want to avoid annoying messages users can't get rid of. If we allow a user to clear the message saying that deletions are in progress, we currently don't have a way to inform them the status of their registry has changed (i.e. the deletion has completed).

If we let users dismiss the alert, it might be beneficial to add a green success alert saying that the pending deletion job has completed. The user would also be able to dismiss this alert.

Given the scale of time @10io has discussed around deleting, I thin it is important that we inform the user that the image repository has changed state from "mid delete process" to "completed delete process". However, I might be going overboard. Any thoughts around this?

I'm curious, I know you mentioned there is no standard GitLab way for throttling, but have other teams rolled out features in some logical order?

@trizzi @10io

@DylanGriffith and groupglobal search is rolling out Advanced search on a per namespace basis. Perhaps you can sync up with them

I'm curious, I know you mentioned there is no standard GitLab way for throttling, but have other teams rolled out features in some logical order?

The standard way today would be to roll out features based on percentages of projects or namespaces using Feature flags as described in https://docs.gitlab.com/ee/development/feature_flags/controls.html .

This did not work for Global Search since we needed extra logic to kick off backfills at the same time as enabling the feature so we implemented a data model called ElasticsearchIndexedNamespace to track the namespaces that are enabled and implemented ElasticNamespaceRolloutWorker which is invoked by an API to rollout to percentages of groups. Our stuff is quite custom based on the needs of the Elasticsearch architecture so I'm not sure how well it generalises to your problem.

It's worth noting that for us the concept of "percentage enabled" is only a facade and really we need to persist the exact groups that have been enabled since percentages are relative to the overall data set of groups which is a moving target and since the data backfill has to be up to date for the group to use Elasticsearch we need to persist the groups explicitly.

@tkuah I think, from discussions with @trizzi, we want to move in small steps. The best way for us, given the amount of work to do for workers, would be to selectively include projects to have container expiration policies.

This way, we have a better control on how many "new" policies are created and used by the cleanup workers. We already have some customers interested in trying these policies, these could be very well our first projects to be included to have a cleanup policy.

At a later time, we can slowly "open the gates" by using a % based feature flag .

Thanks @DylanGriffith for your explanations.

The best way for us, given the amount of work to do for workers, would be to selectively include projects to have container expiration policies.

@10io Yes, that sounds like smallest way to start

removed direction label

added package:active label

Async update

Status

In dev

Complete 30%

Confident 40%

Notes

Anaylzed the current situation and sketched a possible solution, see #208193 (comment 362910703). Waiting for feedback on this

@10io Do we have an idea how many expired images under the estimated 100K container_repositories with a disabled policy ?

@tkuah I'd like to make crystal clear two terms: container_repository and tag. Since the analysis above is deeply technical, I'm re-using the terms I see in the codebase.

Having said that here is which term is what on the container registry pages. When you open the container registry page, you get a list of images:

If you click on an image, you get its tags:

So basically: Project has x container_repository and a container_repository has y tag.

Now, these two objects are not persisted in the same way:

container_repository: persisted in the database and known by rails without accessing the container registry.
tag: not persisted in the database and thus not known by rails without accessing the container registry.

In short, on the rails side we can know x but not y. To know y, rails has to contact the container registry.

Now the two terms are defined, let's dig into the numbers. Following a discussion with @nmezzopera, projects from prior 12.7 don't have a container expiration policy currently. When we will enable the policies globally, all these projects will potentially have a freshly created policy that needs to be executed.

I ran a rough count (this time, with the right conditions) and we have ~420K container_repositories that don't have an expiration policy (and thus are connected to a project prior 12.7)

Do we have an idea how many expired images under the estimated 100K container_repositories with a disabled policy ?

I think your question is how many tags do we have under those 420K container_repositories and the answer is: we can't know in advance. As stated above, we don't persist the tags on the rails side, only the container registry is able to know this number. I'm not sure if there is way to have an estimate from the container registry.

@jdrpereira / @hswimelar Do we have some stats for .com on container registry tags that we could take here into account? Such as: the average tags count per repository, the highest tags count?

Do we have some stats for .com on container registry tags that we could take here into account? Such as: the average tags count per repository, the highest tags count?

Unfortunately no. We extracted similar statistics for the dev registry in the past (https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9033#note_295613525), which can give us a rough idea, but it's not feasible to do the same analysis for the production registry due to technical limitations (because of its size and the time it would take to scan).

I think your question is how many tags do we have under those 420K container_repositories and the answer is: we can't know in advance. As stated above, we don't persist the tags on the rails side, only the container registry is able to know this number. I'm not sure if there is way to have an estimate from the container registry.

Even if we knew how many tags exist under those repositories, that would only serve as a highly pessimistic estimate (better than nothing though), as the expiration policies rely on user defined regex to tell which tags are to be deleted.

@jdrpereira Thanks for these insights!

Unfortunately no. We extracted similar statistics for the dev registry in the past (https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9033#note_295613525), which can give us a rough idea, but it's not feasible to do the same analysis for the production registry due to technical limitations (because of its size and the time it would take to scan).

Wow, am I reading this correctly? Almost 10K tags for the highest count ?

Perhaps the takeaway here is that the tags count distribution goes more towards a (relatively) low tags count. This helps us to pinpoint a limit for DeleteTagsService

Even if we knew how many tags exist under those repositories, that would only serve as a highly pessimistic estimate (better than nothing though), as the expiration policies rely on user defined regex to tell which tags are to be deleted.

Yes, I guess we have no choice other than looking at the worst case scenarios. In this case, I'm trying to base my view on the default policy: no regex, keep_n 10, older_than 90d and cadence 1d. That is not super realistic but we have to start somewhere right?

I ran a rough count (this time, with the right conditions) and we have ~420K container_repositories that don't have an expiration policy (and thus are connected to a project prior 12.7)

That's a lot. I'm a bit worried about it, as we would be opening the doors to all those potentially "dirty" repositories (with a high number of tags to be expired) simultaneously. This would put the registry (and the GitLab workers) under an extreme load.

We might need to consider approaching the rollout of the expiration policy for all gitlab.com projects in batches (e.g. for all projects since 12.5, then all projects since 12.3, etc., until there is nothing left). I'll take this discussion to #196124 (closed).

Wow, am I reading this correctly? Almost 10K tags for the highest count ?

Indeed. We can count with some odd distributions in production as well.

Yes, I guess we have no choice other than looking at the worst case scenarios. In this case, I'm trying to base my view on the default policy: no regex, keep_n 10, older_than 90d and cadence 1d. That is not super realistic but we have to start somewhere right?

Yes, I don't think we can do much better. I think it would be really useful to improve logging for better observability around the expiration policies. If we opt for a phased rollout, we can collect statistics (e.g. batch count, tag count per batch, batch processing time) between each phase and adapt the approach if/as needed.

That's a lot. I'm a bit worried about it, as we would be opening the doors to all those potentially "dirty" repositories (with a high number of tags to be expired) simultaneously. This would put the registry (and the GitLab workers) under an extreme load.

That's why we have to put some Application Limits in place to control the load. By having a root worker orchestrating the underlying workers, we can fine control how many workers can "hammer" the registry and for how much time.

I still do think that DeleteTagsService should have a proper limit to ensure that we don't overload the registry with request.

We might need to consider approaching the rollout of the expiration policy for all gitlab.com projects in batches (e.g. for all projects since 12.5, then all projects since 12.3, etc., until there is nothing left). I'll take this discussion to #196124 (closed).

That's a good idea.

Wow, am I reading this correctly? Almost 10K tags for the highest count ?

That was for gitlab-org/gitlab/gitlab-assets-ee I checked the equivalent on GitLab.com and it has 100k tags: https://gitlab.com/gitlab-org/gitlab/container_registry/

What's curious is that page loads tag counts pretty quickly, certainly more than 0.5 seconds per tag, which would be just shy of 14 hours.

That was for gitlab-org/gitlab/gitlab-assets-ee I checked the equivalent on GitLab.com and it has 100k tags: https://gitlab.com/gitlab-org/gitlab/container_registry/

Even more

What's curious is that page loads tag counts pretty quickly, certainly more than 0.5 seconds per tag, which would be just shy of 14 hours.

Isn't the list loaded in one request to the registry and the rails backend just gets the size of it?

If that's the case, there is room for improvement right there: have an API endpoint just returning the count instead of having the backend getting the whole list. Well, thoughts for future iterations I guess

Isn't the list loaded in one request to the registry and the rails backend just gets the size of it?

Yes.

If that's the case, there is room for improvement right there: have an API endpoint just returning the count instead of having the backend getting the whole list. Well, thoughts for future iterations I guess

As of now (filesystem-based metadata), counting tags or listing them takes roughly the same time in the container registry side (it requires a "list" operation on the repository "folder" regardless). Therefore, the only benefit would be to decrease the response payload, which would lead to negligible gains.

Once we have a metadata database this will be different (one more feature in the queue ).

I'm trying to get my head around some numbers.

From https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9033#note_295613525 (Number of Repositories by Tag Counts), we can see that the number of tags per container_repository is distributed towards (relatively) low numbers.
From https://log.gprd.gitlab.net/goto/1c1b64d8c97b6fb950d56f45f95bddda (internal link), we can see that Projects::Registry::TagsController#destroy execution time is around 0.5s for 50th percentile. We can use that for a baseline for DeleteTagsService#fast_delete execution time although it should be below that in reality.
From https://log.gprd.gitlab.net/goto/a710bd29a70bed4ef5c357fcc9d657c1 (internal link), we can see that ContainerExpirationPolicyWorker enqueues up to 1K+ jobs in one minute.
From https://log.gprd.gitlab.net/goto/167738332449e85113c60eb5e3357ab7 (internal link), we can see that the duration of CleanupContainerRepositoryWorker is well below 0.5s for 50th percentile. This corroborates (2.)

Using these parameters (not fine tuned or crazy things out of the ordinary):

capacity = 200 (number of jobs enqueued max at all times)
max_run_time = 30min
tags per repository = 100
container_repository count = 420K
time to delete one tag = 0.5sec
time to delete everything = (0.5 * 100 * 420K) / 200 = 105000 secs = 1750 hours.
number of calls to ContainerExpirationPolicyWorker = 1750 / 0.5 = 3500 calls.
since ContainerExpirationPolicyWorker is called each 50min (round it to 60min) = 3500 hours

If I'm not mistaken the above gives us roughly 146 days of execution time

Async update

Status

In dev

Complete 30%

Confident 40%

Notes

Tried to get some numbers to get an estimate / sense of what's the size of tags to delete. (#208193 (comment 363960754))

@trizzi What about this plan

we implement the solution depicted above
we put these parameters in the application settings (to have a way to modify them at runtime)
- max_tags_count for DeleteTagsService
- max_run_time for ContainerExpirationPolicyWorker
- max_capacity for ContainerExpirationPolicyWorker
we deploy it and monitor the situation with the logs (see "Other considerations") from workers + the load on the container registry.
once the situation is stable, we move to the next step in the plan for #196124 (closed)

Before starting the implementation, we should state what values we will use for those parameters. In #208193 (comment 363960754), I suggested some values.

These limits will also be applied in self-managed instances, is that an issue? One thing to note here: self-managed instances can be using the #slow_delete method that in slower than #fast_delete.

As I said above in "Future iterations", we might want to expose the parameters in the UI so that admins can control / set up them depending their architecture. Perhaps we should have this right from the start

We definitely need better logging and metrics around the execution of the cleanup policies. Currently, it's very difficult (if not impossible in some cases) to debug issues. It's better to deal with that before opening the doors to potentially many more of them (even if most are false alarms, there is a limit for what we can triage/debug).

Regarding the phased rollout, I like what Tim proposed, doing it by namespace instead of creation date. We'll likely want to have more control over which repositories can be candidates for cleanup on the first stages.

Additionally, we can also consider starting by limiting the execution for repositories that have less than a specific number of tags to be deleted. Such might sound counter-productive, as we would be avoiding the biggest repositories (which are the ones we want to clear the most). Still, if we can get all the small/medium ones cleaned up for the first time (and more importantly, keep cleaning them regularly and much quicker after that), we'll then have headroom and confidence to tackle the biggest ones.

Regarding self-managed instances vs gitlab.com, AFAIK, there is no hard limit on which repositories can be cleaned for self-managed instances (i.e. the policy can be enabled on projects created before 12.8). Is that right @trizzi? If so, we should probably focus on gitlab.com and the self-managed instances using the GitLab Container Registry for this and therefore not worry about the "slow delete" mode (which makes our problems 8 times worse, as it's 8 times slower).

Additional chart with numbers:

From https://dashboards.gitlab.net/d/sidekiq-queue-detail/sidekiq-queue-detail?viewPanel=16&orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-queue=container_repository:cleanup_container_repository&from=1591975271822&to=1592580071822 (internal link), we can see that the avg number of jobs for CleanupContainerRepositoryWorker is around 200 with spikes up to 3K.

The above gives an idea of how many "slots" (max_capacity) we should put for ContainerExpirationPolicyWorker.

Throttling for Cleanup policy for tags

Problem to solve

Intended users

Further details

Proposal

Permissions and Security

Documentation

Availability & Testing

What does success look like, and how can we measure that?

Other concerns

Links / references

Blocks

Relates to

Activity

Async update

Status

Notes

Status

Current Situation

ContainerExpirationPolicyWorker

ContainerExpirationPolicyService

API::ProjectContainerRepositories

Projects::Registry::TagsController

CleanupContainerRepositoryWorker

CleanupTagsService

DeleteTagsService

The problem

Possible solution

Add Application Limits in DeleteTagsService

Refactor ContainerExpirationPolicyWorker

Technical details

Suggested limits

Benefits

Deployment

Side effects

Numbers and charts for .com

Other considerations

MR plan

Future iterations / Improvements

Questions

Thanks

Async update

Status

Notes

Async update

Status

Notes

`ContainerExpirationPolicyWorker`

`ContainerExpirationPolicyService`

`API::ProjectContainerRepositories`

`Projects::Registry::TagsController`

`CleanupContainerRepositoryWorker`

`CleanupTagsService`

`DeleteTagsService`

Add Application Limits in `DeleteTagsService`

Refactor `ContainerExpirationPolicyWorker`