When we turn on the policy feature for all projects, we need a way to prevent the sidekiq queue from backing up and slowing things to a halt, especially if there are cases when there are large amounts of tags to be cleaned up for a given policy on it's first-time running. To prevent thousands of tags from being queued by one project, we will add throttling to the queue or service.
Permissions and Security
There are no permissions changes required for this change.
Documentation
Add a section to the Container Registry Admin docs that jobs will be throttled and how to troubleshoot when something goes wrong.
Availability & Testing
What does success look like, and how can we measure that?
We are able to enable container expiration and retention policies for all projects on GitLab.com and self-managed instances regardless of which container registry is being used.
To me, at least, this is a very unknown area, so I am going to weight this as 3 for now. I think spending time working through the example MR in the description will help understand how big of a change this will be for this issue. @10io, @ggelatti, if you have any experience in this type of area, please feel free to revise the weight (and add any details).
We should use this issue to investigate and implement, if the initial investigation doesn't yield deliverable results, we can break it up for future milestones.
Sorry for the ping. I see it's: #196124 (closed) ; added this issue as a blocker. Please feel free to change it to "related" instead if that's not appropriate.
Started working on this. I poked people in Slack about how would be a good solution to this and it seems that it will be more complex than expected.
I started analyzing the current state and how we can deal to avoid a super large and slow queue. I will report back when I have some results/questions.
Tim Rizzichanged title from Throttling for Tag Cleanup Policies to Throttling for Cleanup policy for tags
changed title from Throttling for Tag Cleanup Policies to Throttling for Cleanup policy for tags
Tim Rizzichanged the descriptionCompare with previous version
Given the complexity of this part and how we can implement throttling, I'd like to lay out what's the current situation and how we can achieve some Application Limits on it so that we don't overflow the queue with slow jobs.
Status
Deployed
Current Situation
graph TD cepw([ContainerExpirationPolicyWorker<br />Cron job running each 50minutes<br />Loop on runnable policies]) pcr[[API::ProjectContainerRepositories]] ceps[ContainerExpirationPolicyService<br />Schedule next run<br />Loop on container_repositories] ccrw([CleanupContainerRepositoryWorker]) cts[CleanupTagsService<br />Validate params<br />Validate permissions<br />Apply filters on tags to delete] dts[DeleteTagsService<br />Loop on tags<br />Call #fast_delete or #slow_delete] prtc[[Projects::Registry::TagsController]] cepw-- container expiration policy -->ceps pcr-- container_repository_id, params -->ccrw ceps-- container_repository_id, params -->ccrw ccrw-- container_repository_id, params -->cts cts-- container_repository, tags -->dts prtc-- container_repository, tags -->dts
Here are my notes for each piece (leaving out not so important details such as all the validations):
ContainerExpirationPolicyWorker
A cron job that is executed each 50 minutes. Its main job is to fetch all the runnable container expiration policies and execute them.
Note that the number of runnable container expiration policies is quite random as it depends on the policy next_run_at attribute and this attribute is set by the container expiration policy cadence
ContainerExpirationPolicyService
A service that will do two main things on the given container expiration policy:
schedules the next run for the policy
loops on each associated container_repositories and enqueues a CleanupContainerRepositoryWorker job with the params out of the policy.
Note that the number of container_repositories is unbounded.
Similar to ContainerExpirationPolicyService, it will enqueue a CleanupContainerRepositoryWorker job with the (user sourced) params.
There is a lease of 1h here. That means that for a given container_repository a user can do 1 request per hour max.
Projects::Registry::TagsController
That's the UI controller that drives the container registry pages. It has a bulk destroy action that will call the DeleteTagsService directly.
There is a strong limit: only 15 tags max can be selected for bulk destruction.
CleanupContainerRepositoryWorker
The central worker that will do the cleaning.
It's important to note that this worker doesn't know anything about how much work there is to do.
CleanupTagsService
This service does 3 main things:
check the params
check the user permissions
get the tags from the registry and apply the given filters on them to get a list of tags to delete.
Once there are tags to delete, they are passed to DeleteTagsService.
The list of tags from a given container_repository is unbounded.
DeleteTagsService
This service deletes tags in two different ways depending on what type of container registry is running:
fast_delete
slow_delete
.com is running on the #fast_delete side as it is using the GitLab Container Registry.
The problem
The main issue is that CleanupContainerRepositoryWorker can be short or long to execute depending on the numner of tags to delete. We can't know this in advance without contacting the container registry (slow operation for now). In short, given a container_repository, we have no means to know in advance how many tags there are to delete
Currently, container expiration policies are automatically created for new projects since 12.7. Projects prior 12.7 don't have a policy and the UI to create them is blocked. This block is behind an application setting: container_expiration_policies_enable_historic_entries.
By enabling the container expiration policies on these pre 12.7 projects, we can potentially fill the queue with slow CleanupContainerRepositoryWorker jobs (eg. jobs with many, many tags to delete) and block the whole queue. That is why this issue as been created: to introduce an Application Limits in the form of throttling so that this doesn't happen.
Rough estimates for gitlab.com:
We have ~720K container_repositories not connected to a container expiration policy.
Possible solution
First of all, I'd like to point that the MR referenced in this issue description (gitlab-foss!7292 (merged)) has been reverted (gitlab-foss#51509 (closed)) due to errors and wrong behaviors. In short, there is no "automatic" / implicit throttling support for jobs at the moment and it should be done on case by case basis.
Second, since we're going to update / modify workers, we should be aware that those workers have a strong execution time limit: 300s.
As for the solution, we can leverage a fact: container expiration policies are executed more than once, so don't try to delete all tags at once. The idea is to have a set of limits for jobs and when one of them is hit, stop the current tags destruction.
Being the last piece interacting with the container registry, DeleteTagsService is one of the few components that knows how much work there is to do. It has all the necessary knowledge to impose a limit. We could even use different limits for the two registry types but for the first iteration we're going to focus on #fast_delete
When hitting the limit, this service will simply don't consider any further tag. This limit should be implemented as a max execution time. This limit should be below the 300s.
Refactor ContainerExpirationPolicyWorker
This worker can orchestrate the underlying worker (CleanupContainerRepositoryWorker) and throttle how much of these are enqueued.
To orchestrate this, this worker will use different limits, in particular it will work with a number of "slots" available for CleanupContainerRepositoryWorker jobs. The overall logic will be:
Check if there are any CleanupContainerRepositoryWorker running from a previous enqueue.
Look and compute how many CleanupContainerRepositoryWorker should be enqueued.
Enqueue them in a reasonable way
Self enqueue itself for in an execution in a near future.
We basically have a loop that checks the ongoing work with CleanupContainerRepositoryWorker and feed more work as slots are available. Since having a long running job is not possible, we will simply self enqueue the job to re-execute in a near future. It's really similar to making a pause of X seconds between the loop iterations.
Technical details
The ContainerExpirationPolicyWorker will follow this code flow:
ContainerExpirationPolicyWorker takes two arguments:
started_at
cache_key
When enqueued by cron, these two arguments will be empty/not used
When self enqueued, they will be filled.
The execution goes like this:
If started_at is given, check that we're not above max_run_time. If that's the case, return.
Compute the slots_available. Use the length of the cache_key array as it's giving the ongoing/pending jobs.
Take max_slots or slots_available container_repositories on which the policy should be executed. We will probably need to order things here using the next_run_at of the policy. This is to ensure that we deal with policies that have the biggest [Time.zone.now - policy.next_run_at] difference first.
In slices of batch_count, enqueue CleanupContainerRepositoryWorker jobs spreading the load using batch_backoff_period. Using the loop as shown in gitlab-com/gl-infra/scalability#461 (comment 372001300). Take note of the job ids in job_ids.
If cache_key was not given, create a new one with the job_ids array.
Re-enqueue ContainerExpirationPolicyWorker to be executed in backoff_period with started_at and cache_key.
In addition, CleanupContainerRepositoryWorker will need to be updated to:
Accept an optional cache_key
When ending or failing, if cache_key was given, remove the job_id from it.
Use a lease on the container_repository.id to ensure that two sidekiq threads will not clean the same container_repository.id at the same time.
If possible, try to generalize the whole logic on abstract classes so that the can be re-used easily elsewhere.
Suggested limits
Names are not final.
Parameter
Suggested starting value
Explanation
max_slots
100
The max number of CleanupContainerRepositoryWorker jobs enqueued by ContainerExpirationPolicyWorker at any given time
batch_count
10
The number of CleanupContainerRepositoryWorker jobs for same timestamp or time slice.
batch_backoff_period
25s
Batches of CleanupContainerRepositoryWorker jobs will be scheduled for times separated by this wait period.
backoff_period
25s
The wait period before ContainerExpirationPolicyWorker re-execute itself
max_reenqueue_time
30min
The period during which ContainerExpirationPolicyWorker will re-enqueue itself.
delete_tags_service_timeout
100s
The max run time for DeleteTagsService#execute when using DeleteTagsService#fast_delete
These limits will be implemented as Application settings so that we can easily tweak them as we go along. These could be exposed on the UI.
Benefits
Both workers are running within the 300s limit.
The load on sidekiq is spread and as a side effect, the load on the container registry will also be spread.
The number of CleanupContainerRepositoryWorker enqueued by ContainerExpirationPolicyWorker is limited at any given time.
We enforce deduplication on CleanupContainerRepositoryWorker jobs, meaning that at any given time we will not have two enqueued jobs with the exact same params
By using execution time limits, both workers are independent of the throughput of the Container Registry on those delete requests. Let's say that the Container Registry has a temporary slowdown on the API requests, this will influence the throughput of the DeleteTagsServices (tags_deleted/min) but the limits will still be properly applied.
Deployment
We want to deploy this feature in small steps. To do so, we're going to rely on several feature flag s and application settings to control how both workers behave:
A first feature flag that would be the behavior change on the ContainerExpirationPolicyWorker. This feature flag would allow us to switch between: no throttling (current implementation) <-> throttling (the implementation detailed above)
The limits used during the throttling mode have to be in the Application Settings so that we can fine tune them quickly without a code deployment (eg. no hard coded limits)
A second feature flag that would be one that is scoped by project and that allows creating a container expiration policy on a project that is older than 12.7. This way, we can selectively allow which pre 12.7 projects will have a container expiration policy and move forward with smaller steps (eg. including more and more projects while keeping an eye on logs).
Side effects
By adding a limit in DeleteTagsService, this service can do a partial work = part of the tags are deleted but not all of them.
This partial work can be seen/displayed in the UI. Example: a user has a project with a single container_repository that has 10000 tags. He wants to remove all of them except the 5 most recent ones. He calls the bulk_delete api but due to the limit (say 1000), 9000 tags remain and are displayed in the UI.
We can mitigate this situation by documenting this limit properly.
The same will happen when enabling all expiration policies. We can imagine that the first runs of ContainerExpirationPolicyWorker will be packed with jobs to do = we will not be able to execute all of them within the max_run_time = It is guaranteed that some container_repositories linked to runnable policies will stay unprocessed even though the policy is executed.
From https://log.gprd.gitlab.net/goto/1c1b64d8c97b6fb950d56f45f95bddda (internal link), we can see that Projects::Registry::TagsController#destroy execution time is around 0.5s for 50th percentile. We can use that for a baseline for DeleteTagsService#fast_delete execution time although it should be below that in reality.
Note that I didn't take into account the API::ProjectContainerRepositories access. This one is under a lease. I think we can solely focus on the ContainerExpirationPolicyWorker as this is the one that will generate much of the bulk of CleanupContainerRepositoryWorker jobs.
MR 1, 2 and 3 are the absolute minimal implementation needed for this change. One note on the weight of MR 3, 3 is a pessimistic view of the work that has to be done. Our discussions with the scalability team really helped to pinpoint the technical details of this MR but we can still get a major surprise.
@trizzi What is the deployment plan of the application setting container_expiration_policies_enable_historic_entries? In other words, how do we enable container expiration policies globally? I see two ways:
We switch the application setting on and let users create container expiration policies for their pre 12.7 projects
We switch the application setting on and we seed a default container expiration policy on all these pre 12.7 projects
@jdrpereira For DeleteTagsService#fast_delete, could we imagine that DELETE /v2/#{name}/tags/references takes a list of tags and delete them all in one single API request? We could limit the number of tags passed but the idea is that for n tags, we would call the api n/max_tags_count times.
that's technically possible and certainly a welcomed improvement, but it would require a new route. Given that we decided to temporarily freeze the development of new Container Registry features, I would leave that for a future iteration, unless it becomes a real issue/bottleneck, in which case we can reevaluate. We have to continue supporting individual tag deletion for third-party registries for now, so this seems feasible regardless.
Thanks
To @reprazent, @tkuah and @brodock for their help while brainstorming on a solution for this issue.
The side effects seem reasonable to me in order to enable cleaning up so many tags that are currently just sitting there.
I think it'll be important to inform the users as best we can in the UI when incomplete deletions exist. Based on our slack conversation, I put together to rough sketches of a way we might do that:
Image Repository List View Notification
This design includes an icon w/ tooltip stating the image repository has an incomplete tag deletion job
This comp includes a high-level alert that tells the user more information and includes a link to documentation
In an ideal world, we would be able to state which tags specifically are scheduled for deletion. Given that this isn't technically feasible right now, a more general alert would inform the user.
For DeleteTagsService#fast_delete, could we imagine that DELETE /v2/#{name}/tags/references takes a list of tags and delete them all in one single API request? We could limit the number of tags passed but the idea is that for n tags, we would call the api n/max_tags_count times.
@10io that's technically possible and certainly a welcomed improvement, but it would require a new route. Given that we decided to temporarily freeze the development of new Container Registry features, I would leave that for a future iteration, unless it becomes a real issue/bottleneck, in which case we can reevaluate. We have to continue supporting individual tag deletion for third-party registries for now, so this seems feasible regardless.
Another great write-up @10io! Just to make sure I understand clearly..
The tradeoff to adding hard limits will be that it may take a long time for an expiration policy to finish running and we'll handle that by notifying them in the UI with an icon and by adding a note about this in the documentation.
Separately, we can't simply turn this on for all projects. We'll have to roll this out in some logical order.
I'm OK with this plan. @icamacho one thing I'd like to avoid is having a warning that the user can not clear. But, the comps look good to me.
What is the deployment plan of the application setting
There are two parts to this. First, how do we roll the feature out in a scalable manner? Second, should we seed a default cleanup policy for each project as we roll it out?
For the former, I know @jdrpereira mentioned rolling it out based on the date of the project's creation. What do you think about rolling the feature out by namespace? This would allow us to choose a few alpha/beta customers to test with and monitor the results. We could start with one company, then 5, then 10, etc., until we feel that we are ready for a broader rollout. (Maybe we could even combine approaches)
I'm curious, I know you mentioned there is no standard GitLab way for throttling, but have other teams rolled out features in some logical order? @joshlambert did we phase the Puma rollout?
With regards to seeding a default cleanup policy.. I think we should do this. Right now the default settings fairly conservative with:
The expiration interval is set to 90 days
The expiration schedule is set to run weekly
The number of tags to retain is set to 10.
All tags matching the above criteria are set for expiration
The tradeoff to adding hard limits will be that it may take a long time for an expiration policy to finish running and we'll handle that by notifying them in the UI with an icon and by adding a note about this in the documentation.
It's more that it will take a long time for an expiration policy to run that we will need to stop enqueuing jobs and the work will need to resume the next time the container expiration policy is executed. For example, let's say that a policy after evaluating it has to delete 30K tags on different container_repository and we have a limit of 30min on the runtime. We let the workers do their job and after 30min, they deleted 25K. We will not enqueue any more jobs and there will be 5K tags to delete remaining.
As such, a container_repository can be in one of these states after its policy has run:
No tags were deleted. This can happen when there is so much work at once that some policies will not enqueue any delete job.
Part of the tags were deleted. That's the example above. This is were the frontend would show an indication that Hey, due to volume, the container expiration policy was partially executed
All the tags were deleted <- that's the ideal, perfect case.
Given the amount of work we will receive when enabling the policies globally, we will encounter (1.) and (2.)
For the former, I know @jdrpereira mentioned rolling it out based on the date of the project's creation. What do you think about rolling the feature out by namespace? This would allow us to choose a few alpha/beta customers to test with and monitor the results. We could start with one company, then 5, then 10, etc., until we feel that we are ready for a broader rollout. (Maybe we could even combine approaches)
That can work too. Given the numbers I tried to get, we're in for a massive amount of jobs. As such, we should try to distribute this "load" across different weeks / milestones. Since this aspect is more about rolling out globally the policies, can we move this discussion to #196124 (closed)?
With regards to seeding a default cleanup policy.. I think we should do this. Right now the default settings fairly conservative with:
I agree but from @jdrpereira's comment (#196124 (comment 363566582)), some projects might not want to have a container expiration policy enabled. Again that's more of a rolling out policies aspect
I'm curious, I know you mentioned there is no standard GitLab way for throttling, but have other teams rolled out features in some logical order?
By throttling, I'm talking about the worker queue management. From what I gather, there is nothing in sidekiq or in GitLab to handle it directly. The thing that is the closest is a scheduler worker = a parent worker that will enqueue jobs for children workers and keep an eye on them so that the number of jobs enqueued is under control at all times.
I'm OK with this plan. @icamacho one thing I'd like to avoid is having a warning that the user can not clear. But, the comps look good to me.
@trizzi I also want to avoid annoying messages users can't get rid of. If we allow a user to clear the message saying that deletions are in progress, we currently don't have a way to inform them the status of their registry has changed (i.e. the deletion has completed).
If we let users dismiss the alert, it might be beneficial to add a green success alert saying that the pending deletion job has completed. The user would also be able to dismiss this alert.
Given the scale of time @10io has discussed around deleting, I thin it is important that we inform the user that the image repository has changed state from "mid delete process" to "completed delete process". However, I might be going overboard. Any thoughts around this?
This did not work for Global Search since we needed extra logic to kick off backfills at the same time as enabling the feature so we implemented a data model called ElasticsearchIndexedNamespace to track the namespaces that are enabled and implemented ElasticNamespaceRolloutWorker which is invoked by an API to rollout to percentages of groups. Our stuff is quite custom based on the needs of the Elasticsearch architecture so I'm not sure how well it generalises to your problem.
It's worth noting that for us the concept of "percentage enabled" is only a facade and really we need to persist the exact groups that have been enabled since percentages are relative to the overall data set of groups which is a moving target and since the data backfill has to be up to date for the group to use Elasticsearch we need to persist the groups explicitly.
@tkuah I think, from discussions with @trizzi, we want to move in small steps. The best way for us, given the amount of work to do for workers, would be to selectively include projects to have container expiration policies.
This way, we have a better control on how many "new" policies are created and used by the cleanup workers. We already have some customers interested in trying these policies, these could be very well our first projects to be included to have a cleanup policy.
At a later time, we can slowly "open the gates" by using a % based feature flag .
@tkuah I'd like to make crystal clear two terms: container_repository and tag. Since the analysis above is deeply technical, I'm re-using the terms I see in the codebase.
Having said that here is which term is what on the container registry pages. When you open the container registry page, you get a list of images:
If you click on an image, you get its tags:
So basically: Project has xcontainer_repository and a container_repository has ytag.
Now, these two objects are not persisted in the same way:
container_repository: persisted in the database and known by rails without accessing the container registry.
tag: not persisted in the database and thus not known by rails without accessing the container registry.
In short, on the rails side we can know x but not y. To know y, rails has to contact the container registry.
Now the two terms are defined, let's dig into the numbers. Following a discussion with @nmezzopera, projects from prior 12.7 don't have a container expiration policy currently. When we will enable the policies globally, all these projects will potentially have a freshly created policy that needs to be executed.
I ran a rough count (this time, with the right conditions) and we have ~420K container_repositories that don't have an expiration policy (and thus are connected to a project prior 12.7)
Do we have an idea how many expired images under the estimated 100K container_repositories with a disabled policy ?
I think your question is how many tags do we have under those 420K container_repositories and the answer is: we can't know in advance. As stated above, we don't persist the tags on the rails side, only the container registry is able to know this number. I'm not sure if there is way to have an estimate from the container registry.
@jdrpereira / @hswimelar Do we have some stats for .com on container registry tags that we could take here into account? Such as: the average tags count per repository, the highest tags count?
Do we have some stats for .com on container registry tags that we could take here into account? Such as: the average tags count per repository, the highest tags count?
Unfortunately no. We extracted similar statistics for the dev registry in the past (https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9033#note_295613525), which can give us a rough idea, but it's not feasible to do the same analysis for the production registry due to technical limitations (because of its size and the time it would take to scan).
I think your question is how many tags do we have under those 420K container_repositories and the answer is: we can't know in advance. As stated above, we don't persist the tags on the rails side, only the container registry is able to know this number. I'm not sure if there is way to have an estimate from the container registry.
Even if we knew how many tags exist under those repositories, that would only serve as a highly pessimistic estimate (better than nothing though), as the expiration policies rely on user defined regex to tell which tags are to be deleted.
Unfortunately no. We extracted similar statistics for the dev registry in the past (https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9033#note_295613525), which can give us a rough idea, but it's not feasible to do the same analysis for the production registry due to technical limitations (because of its size and the time it would take to scan).
Wow, am I reading this correctly? Almost 10K tags for the highest count ?
Perhaps the takeaway here is that the tags count distribution goes more towards a (relatively) low tags count. This helps us to pinpoint a limit for DeleteTagsService
Even if we knew how many tags exist under those repositories, that would only serve as a highly pessimistic estimate (better than nothing though), as the expiration policies rely on user defined regex to tell which tags are to be deleted.
Yes, I guess we have no choice other than looking at the worst case scenarios. In this case, I'm trying to base my view on the default policy: no regex, keep_n 10, older_than 90d and cadence 1d. That is not super realistic but we have to start somewhere right?
I ran a rough count (this time, with the right conditions) and we have ~420K container_repositories that don't have an expiration policy (and thus are connected to a project prior 12.7)
That's a lot. I'm a bit worried about it, as we would be opening the doors to all those potentially "dirty" repositories (with a high number of tags to be expired) simultaneously. This would put the registry (and the GitLab workers) under an extreme load.
We might need to consider approaching the rollout of the expiration policy for all gitlab.com projects in batches (e.g. for all projects since 12.5, then all projects since 12.3, etc., until there is nothing left). I'll take this discussion to #196124 (closed).
Wow, am I reading this correctly? Almost 10K tags for the highest count ?
Indeed. We can count with some odd distributions in production as well.
Yes, I guess we have no choice other than looking at the worst case scenarios. In this case, I'm trying to base my view on the default policy: no regex, keep_n 10, older_than 90d and cadence 1d. That is not super realistic but we have to start somewhere right?
Yes, I don't think we can do much better. I think it would be really useful to improve logging for better observability around the expiration policies. If we opt for a phased rollout, we can collect statistics (e.g. batch count, tag count per batch, batch processing time) between each phase and adapt the approach if/as needed.
That's a lot. I'm a bit worried about it, as we would be opening the doors to all those potentially "dirty" repositories (with a high number of tags to be expired) simultaneously. This would put the registry (and the GitLab workers) under an extreme load.
That's why we have to put some Application Limits in place to control the load. By having a root worker orchestrating the underlying workers, we can fine control how many workers can "hammer" the registry and for how much time.
I still do think that DeleteTagsService should have a proper limit to ensure that we don't overload the registry with request.
We might need to consider approaching the rollout of the expiration policy for all gitlab.com projects in batches (e.g. for all projects since 12.5, then all projects since 12.3, etc., until there is nothing left). I'll take this discussion to #196124 (closed).
What's curious is that page loads tag counts pretty quickly, certainly more than 0.5 seconds per tag, which would be just shy of 14 hours.
Isn't the list loaded in one request to the registry and the rails backend just gets the size of it?
If that's the case, there is room for improvement right there: have an API endpoint just returning the count instead of having the backend getting the whole list. Well, thoughts for future iterations I guess
Isn't the list loaded in one request to the registry and the rails backend just gets the size of it?
Yes.
If that's the case, there is room for improvement right there: have an API endpoint just returning the count instead of having the backend getting the whole list. Well, thoughts for future iterations I guess
As of now (filesystem-based metadata), counting tags or listing them takes roughly the same time in the container registry side (it requires a "list" operation on the repository "folder" regardless). Therefore, the only benefit would be to decrease the response payload, which would lead to negligible gains.
Once we have a metadata database this will be different (one more feature in the queue ).
From https://log.gprd.gitlab.net/goto/1c1b64d8c97b6fb950d56f45f95bddda (internal link), we can see that Projects::Registry::TagsController#destroy execution time is around 0.5s for 50th percentile. We can use that for a baseline for DeleteTagsService#fast_delete execution time although it should be below that in reality.
we put these parameters in the application settings (to have a way to modify them at runtime)
max_tags_count for DeleteTagsService
max_run_time for ContainerExpirationPolicyWorker
max_capacity for ContainerExpirationPolicyWorker
we deploy it and monitor the situation with the logs (see "Other considerations") from workers + the load on the container registry.
once the situation is stable, we move to the next step in the plan for #196124 (closed)
Before starting the implementation, we should state what values we will use for those parameters. In #208193 (comment 363960754), I suggested some values.
These limits will also be applied in self-managed instances, is that an issue? One thing to note here: self-managed instances can be using the #slow_delete method that in slower than #fast_delete.
As I said above in "Future iterations", we might want to expose the parameters in the UI so that admins can control / set up them depending their architecture. Perhaps we should have this right from the start
We definitely need better logging and metrics around the execution of the cleanup policies. Currently, it's very difficult (if not impossible in some cases) to debug issues. It's better to deal with that before opening the doors to potentially many more of them (even if most are false alarms, there is a limit for what we can triage/debug).
Regarding the phased rollout, I like what Tim proposed, doing it by namespace instead of creation date. We'll likely want to have more control over which repositories can be candidates for cleanup on the first stages.
Additionally, we can also consider starting by limiting the execution for repositories that have less than a specific number of tags to be deleted. Such might sound counter-productive, as we would be avoiding the biggest repositories (which are the ones we want to clear the most). Still, if we can get all the small/medium ones cleaned up for the first time (and more importantly, keep cleaning them regularly and much quicker after that), we'll then have headroom and confidence to tackle the biggest ones.
Regarding self-managed instances vs gitlab.com, AFAIK, there is no hard limit on which repositories can be cleaned for self-managed instances (i.e. the policy can be enabled on projects created before 12.8). Is that right @trizzi? If so, we should probably focus on gitlab.com and the self-managed instances using the GitLab Container Registry for this and therefore not worry about the "slow delete" mode (which makes our problems 8 times worse, as it's 8 times slower).
We could use a feature flag scoped by project which enable use to apply the above behavior (scheduler worker) only on selected projects.
It's a bit more work upfront as we have to keep the current ContainerExpirationPolicyWorker implementation and have a new one at the same time (the feature flag will chose the correct one). I still think it's worth it: it allows us to do small steps with this on production which by the amount of tags to delete is not a bad thing at all
After some thoughts, we can't use a single feature flag to gate this feature. The main reason is that the ContainerExpirationPolicyWorker needs to be changed for all policies at once. We can't have a mode for certain policies and a different mode for other policies. Thus, I suggest having several knobs/switches to control how this worker is handling things:
A first feature flag that would be the behavior change on the ContainerExpirationPolicyWorker. This feature flag would allow us to switch between: no throttling (current implementation) <-> throttling (applying some limits and have it long running so that it can monitor what's happening)
The limits used during the throttling mode have to be in the Application Settings so that we can fine tune them quickly without a code deployment (eg. no hard coded limits)
A second feature flag that would be one that is scoped by project and that allows creating a container expiration policy on a project that is older than 12.7. This way, we can selectively allow which pre 12.7 projects will have a container expiration policy and move forward with smaller steps (eg. including more and more projects while keeping an eye on logs).
By having two feature flag we can "revert" to the current implementation at will. I think we can start with the numbers stated in the scenario in #208193 (comment 363960754) and start observing how the system reacts.
As there is no change in the permissions and CleanupTagsService validates the permissions, I'm marking this as reviewed with no-action required from appsec.
Refactored the analysis comment to add a deployment section. Pinged the scalability team on gitlab-com/gl-infra/scalability#461 and asked about having a long running worker. Current SLO for ContainerExpirationPolicyWorker is 300sec / 5min. We're not going very far with that
Started working on limiting the execution time for DeleteTagsService.
We're having an interesting discussion in the scalability issue about the approach for the ContainerExpirationPolicy worker with a good alternative to a scheduler worker: gitlab-com/gl-infra/scalability#461 (comment 372001300).
MR
!35539 (merged) - Add logging to DeleteTagsService - in review
I still don't get why on this limitation is enabled for on-premise installations. I think each on-premise installation administrator must be able to decide on that.
@glensc Self-managed instances are able to turn the feature on for all projects by updating their app's settings.
If you set container_expiration_policies_enable_historic_entries to true, existing projects will have the ability to have policies set. https://docs.gitlab.com/ee/api/settings.html
Updated the analysis above with the last discussion results.
Opened MR 1 Add a execution time limit to DeleteTagsService. While implementing this, I realized that it would be beneficial to better organized the code around DeleteTagsService. I opened a second MR for just that.
MRs
MR 1: Add a execution time limit to DeleteTagsService - !36319 (merged) - in dev
!36337 (merged) - Splitting DeleteTagsService in smaller services - in dev
Added more rspec examples for MR1 and prepared the MR for the review. I expect to be ready for review tomorrow.
I investigated how many times CleanupContainerRepositoryWorker has a duration > 300s (this will trigger an alert for the production team). Those case would be partially executed if MR1 was merged. Here is the corresponding dashboard:
MR 2 ready for review but I still have some specs failing.
Started thinking in how to deal with MR 3. I will probably split the class in two sub classes (throttled and unthrottled version). The actual class will select the proper sub class using the feature flag.
MRs
MR 1: Add a execution time limit to DeleteTagsService - !36319 (merged) - in maintainer review
MR 2: Add jids cache key support to CleanupContainerRepositoryWorker - !40393 (closed) - ~"workflow::In dev"
!36337 (merged) - Splitting DeleteTagsService in smaller services - merged
MR 1: Addressed last documentation feedback and back
MR 2: Addressed review feedback and back
MR 3: Implemented the worker changes and the service that will enqueue CleanupContainerRepositoryWorker in a controlled way using all the Application Limits defined.
MRs
MR 1: Add a execution time limit to DeleteTagsService - !36319 (merged) - in maintainer review
MR 2: Add jids cache key support to CleanupContainerRepositoryWorker - !40393 (closed) - ~"workflow::In review"
!36337 (merged) - Splitting DeleteTagsService in smaller services - merged
MR 1: Addressed last documentation feedback and back
MR 2: Danger bot selected maintainer pass the ball to another maintainer who is at capacity. Maintainer review will start next week.
MR 3: Finalized the big processing steps in the service handling the whole logic and started implementing tests. MR created in ~"workflow::In dev" status.
MRs
MR 1: Add a execution time limit to DeleteTagsService - !36319 (merged) - in maintainer ~"workflow::In review"
MR 2: Add jids redis key support to CleanupContainerRepositoryWorker - !40393 (closed) - ~"workflow::In review"
MR 3: Add limits and throttling to Container Expiration Policies - !40740 (merged) - ~"workflow::In dev"
!36337 (merged) - Splitting DeleteTagsService in smaller services - merged
MR 1: Addressed last documentation feedback and back
MR 2: Maintainer suggested using a simpler approach: !40393 (comment 404454972). Pinged the scalability team members for a discussion comparing the approach of the above analysis and the suggested simpler approach (that requires specific and custom handling in production nodes)
MRs
MR 1: Add a execution time limit to DeleteTagsService - !36319 (merged) - in maintainer ~"workflow::In review"
MR 2: Add jids redis key support to CleanupContainerRepositoryWorker - !40393 (closed) - ~"workflow::In review"
MR 3: Add limits and throttling to Container Expiration Policies - !40740 (merged) - ~"workflow::In dev"
!36337 (merged) - Splitting DeleteTagsService in smaller services - merged
MR 2: Discussion has progressed (!40393 (comment 404454972)) and it seems to lean towards the original approach (the one depicted in the analysis above)
MR 3: Service done. I started creating the test suite and playing with the different limits.
MRs
MR 1: Add a execution time limit to DeleteTagsService - !36319 (merged) - merged
MR 2: Add jids redis key support to CleanupContainerRepositoryWorker - !40393 (closed) - ~"workflow::In review"
MR 3: Add limits and throttling to Container Expiration Policies - !40740 (merged) - ~"workflow::In dev"
!36337 (merged) - Splitting DeleteTagsService in smaller services - merged
MR 1: prepared some dummy container repositories on staging to test MR1 and its friends.
MR 2: It seems that we're clear to go forward with the original approach (!40393 (comment 406668001)). Fixed the rebase conflict and back to maintainer
MR 3: Service test examples implemented and fixing a few small bugs along the way. Now working on refactoring to parent worker so that depending on the feature flag the proper service is selected and executed.
MRs
MR 1: Add a execution time limit to DeleteTagsService - !36319 (merged) - merged
MR 2: Add jids redis key support to CleanupContainerRepositoryWorker - !40393 (closed) - ~"workflow::In review"
MR 3: Add limits and throttling to Container Expiration Policies - !40740 (merged) - ~"workflow::In dev"
!36337 (merged) - Splitting DeleteTagsService in smaller services - merged
MR 2: It has been asked if we merge MR 2 changes with MR 3. Although the downside is that MR 3 will be a bigger MR, the benefit is that MR 3 will then given a vertical view of the feature. In short, everything is contained within MR3.
MR 3: Switched gears on the approach. We are now marking container repositories as "partially" cleaned. This will create a set of backfilled objects that need to be processed. Using this "mark" the expiration policy worker can quickly know how much work there is and how to organize it into container repository cleanup jobs.
MRs
MR 1: Add a execution time limit to DeleteTagsService - !36319 (merged) - merged
MR 2: Add jids redis key support to CleanupContainerRepositoryWorker - !40393 (closed) - ~"workflow::In review"
MR 3: Add limits and throttling to Container Expiration Policies - !40740 (merged) - ~"workflow::In dev"
!36337 (merged) - Splitting DeleteTagsService in smaller services - merged
@dcroft MR 3 is a complex one (it was weighted rightfully as a 3). I'm already on my third refactoring on the approach. Once I put all the logs needed in the worker, I will need to test it properly locally and this will bump my confidence in the implementation accuracy.
In addition, we will need a proper testing on staging that will further increase the confidence.
I just noticed that we're missing the feature flag label although this change is behind a feature flag. Toggling it off will revert the behavior back to what we have currently. This provides an additional safety net.
I like this new approach, it's really simpler and cleaner.
Added all the necessary logs so that we can follow the worker activity in the logs (and later on Kibana). A few more cleanups and it will be ready for review.
I also need to test the whole system locally to make sure that it behaves properly and this will boost my confidence in the solution since, up to now, I've been working on each individual . It is time to test the whole machinery!
It appears to be clear that MR3 is taking way longer than expected.
On top of that we had
A suggested alternative on MR2 that took time to analyze. (it was not followed as it was not feasible)
We encountered an issue on MR3 where it was not clear how to mark the container repositories so that we have a clear vision on what work needs to be done. This point was missed by the above analysis and took time to deal with.
A suggested alternative on MR3 as a different team was tackling the exact same problem (a lot of work to do for workers that somehow they need to be limited to avoid resources exhaustion). In order to centralize the worker logic, a concern has been created. MR3 has now been modified to use it.
During my sync chats with the EM and PM, I warned that MR3 was indeed complex and was taking time.
Note that we couldn't break MR3 into smaller MRs to reduce its complexity. As a matter of fact, a maintainer even asked me to have all the logic together (as it is in MR3) so that reviewers can get the whole picture and can provide meaningful feedback.
During those sync chats, it was agreed that MR3 could slip out of %13.4 . Due to how central these container repository cleanups are, it's better to take time to have a proper and accurate implementation than rushing the implementation and watch the production .
Note that MR 3 was properly weighted as 3 and according to our weighting 3 can contain some surprises which was definitely the case here.
What to improve
We could have extracted MR3 on its own issue to allow a more deeper investigation but I'm not sure it would have been worth it as its complexity and surprises were only revealed when the implementation started.
I definitely could have better communicated that MR3 was slipping out of %13.4 in this issue so that the information was publicly available.
Action Item: I should keep an eye to be more in line with the transparency value
The approach based on Redis had a scalability issue that we might hit on gitlab.com . The approach was changed to rely on the database only. I implemented the overall approach and its corresponding tests. It looks simple as the redis one. A few more cleanups and changes to make the MR ready for review. Being database based, we will need the queries and their plans for a database review.
Reorganized the code in the MR and clean it up for its review.
Started testing it locally by introducing a latency when delete a single tag. 2 ~bug s squashed.
First impressions: it's working as expected, repositories are marked according if they are waiting for a cleanup, with an ongoing cleanup or with an unfinished cleanup. I need to further test this with other cases to improve my confidence that the implementation is accurate but it's good to see finally all the :gear:s working together
I expect to finish my local tests tomorrow and push the MR to ~"workflow::In review"
MRs
MR 1: Add a execution time limit to DeleteTagsService - !36319 (merged) - merged
MR 2: Add jids redis key support to CleanupContainerRepositoryWorker - !40393 (closed) - closed and merged in MR 3
MR 3: Add limits and throttling to Container Expiration Policies - !40740 (merged) - ~"workflow::In dev"
!42598 (merged) - Add expiration_policy_started_at support in container repositories - merged
!36337 (merged) - Splitting DeleteTagsService in smaller services - merged
Maintainer backend review revealed a flaw when the cron worker loops on all the executable policies. In simple words, this set is too big for the cron worker = it's execution never ends.
We thus need first to fix this issue of "selecting too many policies". For that, we opened an MR to deal with #263110 (closed).
This will delay MR 3 a bit more and so since we still need to have it deployed on staging and tested on staging, we have a risk of having it slipping out of the current milestone.
MRs
MR 1: Add a execution time limit to DeleteTagsService - !36319 (merged) - merged
MR 2: Add jids redis key support to CleanupContainerRepositoryWorker - !40393 (closed) - closed and merged in MR 3
MR 3: Add limits and throttling to Container Expiration Policies - !40740 (merged) - ~"workflow::In review"
!42598 (merged) - Add expiration_policy_started_at support in container repositories - merged
!36337 (merged) - Splitting DeleteTagsService in smaller services - merged
It is clear that MR 3 will not make it for 13.5. This is due to several flaws we discovered:
The background jobs selects way too many policies to execute. That's issue #263110 (closed). There is also this MR !44757 (merged) that will help to reduce the scope size.
The underlying service had an issue with user permissions when it was run by a cleanup policy execution. Under these conditions, it wouldn't do any tag deletion. That is fixed with !43359 (merged).
These flaws are quite deep and need to be resolved first before merging, deploying and enabling MR 3.
Maintainer reviews resumed. database maintainer review revealed that one of the loops could hit the database statement timeout. This needs to be addressed and we already have a solution for that.
MRs
MR 1: Add a execution time limit to DeleteTagsService - !36319 (merged) - merged
MR 2: Add jids redis key support to CleanupContainerRepositoryWorker - !40393 (closed) - closed and merged in MR 3
MR 3: Add limits and throttling to Container Expiration Policies - !40740 (merged) - ~"workflow::In review"
!42598 (merged) - Add expiration_policy_started_at support in container repositories - merged
!36337 (merged) - Splitting DeleteTagsService in smaller services - merged
There is a last discussion around the policy loop that the cron work has to get through. The main issue is that queries for executing this loop are not that great ~performance and as the loop grows (more policies to execute) we will hit an issue.
Note that #267546 (closed) to further discuss this and implement a solution. Right now, for MR 3 we're going to choose the best temporary solution, so that the MR can get merged and we can start working through the backlog of projects in need of a cleanup policy.
MRs
MR 1: Add a execution time limit to DeleteTagsService - !36319 (merged) - merged
MR 2: Add jids redis key support to CleanupContainerRepositoryWorker - !40393 (closed) - closed and merged in MR 3
MR 3: Add limits and throttling to Container Expiration Policies - !40740 (merged) - ~"workflow::In review"
!42598 (merged) - Add expiration_policy_started_at support in container repositories - merged
!36337 (merged) - Splitting DeleteTagsService in smaller services - merged
This last discussion around the policies loop is giving us more surprises than expected. database maintainer review revealed that the optimizer is having hard time to come up with a good plan. I have been suggested to help it.
MRs
MR 1: Add a execution time limit to DeleteTagsService - !36319 (merged) - merged
MR 2: Add jids redis key support to CleanupContainerRepositoryWorker - !40393 (closed) - closed and merged in MR 3
MR 3: Add limits and throttling to Container Expiration Policies - !40740 (merged) - ~"workflow::In review"
!42598 (merged) - Add expiration_policy_started_at support in container repositories - merged
!36337 (merged) - Splitting DeleteTagsService in smaller services - merged
feature flag enabled on gitlab.com. We are monitoring the situation a few days to properly adjust the capacity setting. Currently, we're using the conservative value of 2.
Found a bug that would prevent some logs to be on Kibana. MR opened to fix this: !46903 (merged)
MRs
MR 1: Add a execution time limit to DeleteTagsService - !36319 (merged) - merged
MR 2: Add jids redis key support to CleanupContainerRepositoryWorker - !40393 (closed) - closed and merged in MR 3
MR 3: Add limits and throttling to Container Expiration Policies - !40740 (merged) - merged
!42598 (merged) - Add expiration_policy_started_at support in container repositories - merged
!36337 (merged) - Splitting DeleteTagsService in smaller services - merged