Investigation: Handle retrying failing deletes that result to infinite loops

added Category:Container Registry devopspackage maintenanceusability sectionops typemaintenance labels

added groupcontainer registry sectionci labels and removed sectionops label

marked this issue as related to #460274

mentioned in issue #460274

Thanks for raising this @adie.po!

@trizzi, can we get this scheduled?

17.3 is looking tight, but I'll try.

changed milestone to %17.3

changed milestone to %17.4

mentioned in issue #474274 (closed)

added Deliverable Engineering Time rails workflowrefinement labels

added Package:P1 label

Overview

The ContainerRegistry::DeleteContainerRepositoryWorker runs when there are container repositories that have the status delete_scheduled. Once picked up, the container repository's status is updated to delete_ongoing.

During deletion, failures can occur due to timeouts, network errors, permission issues, etc., and can be either transient or permanent.

When the worker fails, the container repository's status is set back to delete_scheduled so it can be picked up the next time a new worker instance runs. This works great for intermittent errors where the deletion can succeed in the next run. But for more permanent issues, the container repository fails every time leading to an endless loop on the workers.

Problem

For the case in #460274, the DeleteContainerRepositoryWorker runs with a container repository and fails when deleting tags due to permission errors. These errors are permanent and cannot be resolved with retries until the permissions are fixed

To illustrate, the following happens:

A new worker instance is started, Worker A
Worker A picks up the next container repository that has the status: delete_scheduled.
Worker A picked up Container Repository 1 and sets its status as delete_ongoing
Worker A attempts to delete the tags of Container Repository 1
The tag deletion fails and Worker A sets the status of ContainerRepository 1 back to delete_scheduled
Worker A successfully completes.
A new worker instance, Worker B, is started as there are still container repositories with status: delete_scheduled
Worker B then picks up the next container repository for deletion, and picks up Container Repository 1 again
Worker B does the same thing from Step 3 - 7, fails, sets the status back, and successfully completes.
A new instance gets started again and the loop continues.

The container repository is being set as delete_scheduled every time it fails, for an unlimited number of times until it succeeds. This leads to an endless loop of worker instances trying to delete the same container repository over and over, which can lead to resource exhaustion, as noted in #460274, where the CPU is maxed out with the worker being ran multiple times.

Proposed Solution

A. Set a max number of retries

We can set a maximum number of times we will try to delete a container repository. If it is exceeded, then we tag the container repository with a different status so we don't try to delete it anymore. Administrators can then refer to this status to know and be able to fix which container repository failed. With this, we don't just attempt to delete the container repository endlessly.

Add a field failed_deletion_count to ContainerRepository which we increase every time we attempted to delete the container repository and it fails.
Add a new status: deletion_failed_permanently to ContainerRepository. When picking up the container repository for deletion, if the failed_deletion_count has reached the MAX_FAILED_DELETION_COUNT, then we set the status to deletion_failed_permanently so it will not be picked up again.

It can be helpful to display this status somewhere so users can know that something has failed permanently and need further investigation.

Repositories marked as deletion_failed_permanently will require manual intervention to be reset to delete_scheduled. Once administrators have determined and fixed the problem, they can set the status of the container repository to delete_scheduled and it will be picked up again for deletion.

B. Introduce backoff strategy before attempting to delete again

In addition to the new field failed_deletion_count and new status deletion_failed_permanently in A, we also add a new field next_delete_attempt_at to use a backoff delay so the same repository is not picked up too frequently.

For example, the backoff delay can start with a base of 2 seconds and doubles with each failure (Time.now + 2 ^ failed_deletion_count seconds). The more failed deletes that happened, the more we increase the delays before the container repository is picked up again for deletion, providing time for transient issues to resolve.

When a container repository's deletion failed,

if failed_deletion_count has reached MAX_FAILED_DELETION_COUNT, then we change the status to deletion_failed_permanently, else:
we increment failed_deletion_count by 1
we set next_delete_attempt_at (i.e. Time.now + 2 ^ failed_deletion_count seconds)

And then when picking up the next container repository for deletion, we filter by those that have next_delete_attempt_at in the past. Filtering by next_delete_attempt_at could necessitate indexing this field to ensure that the worker's performance does not degrade as the number of container repositories grows.

Implementation Details:

For the first iteration, we can start with A (adding a counter to record failed deletion attempts, set a maximum number of times that deletion can fail, and set the status as deletion_failed_permanently when the max is reached. The max can either be a constant or implemented as an application setting.
Consult with UX / Frontend if we need to show if the container repository reached the status of deletion_failed_permanently.
Evaluate if we need to add an additional field, next_delete_attempt_at, to implement backoff delay.

This is for removing a repository DeleteContainerRepositoryWorker. The same plan can also be implemented for cleanup policies -- adding a new expiration_policy_cleanup_status that signifies that it has reached the max number of failures.

Requesting for your feedback please, when you have time @10io @jdrpereira! thanks!

@adie.po, thanks for the investigation and proposal! The plan sounds good for me. We use a similar strategy for online GC workers in the container registry which has been working well for us.

For B, we could not only have the initial backoff (e.g. 5s) but also a max backoff (e.g. 24h). If something does not resolve within 24h, then it's likely permanent/data related. Depending on the number of workers we have in parallel, we can also use a randomization factor to ease concurrency. However, if I recall, this is currently configured with a low number, so it's not super important if that's the case.

A without B will still lead to bursts of back-to-back failed executions until the max retries are exhausted. If complexity is not too high (likely not), adding the two would be ideal (I'm afraid we'll end up postponing B indefinitely if done separately).

Let's see what David thinks too.

Thanks @adie.po for the ping!

The plan sounds good.

However, I'm wondering about the UX side with the scenario of "moving a project".

If I recall correctly, some users reporting this problem were trying to move a project. Currently, that's not possible if there are any container registry images attached to it. Thus, they tried to remove the images first.

My question is: what do we do with images in deletion_failed_permanently? How can a user wanting to move the project, progress on that aspect?

The solutions I see are:

Let users retry the destruction. If a known hiccup is taking days to resolve, users might want to re-initiate the image removal.
Let users destroy the image on the rails side only.
- Put several s that this is a dangerous operation that can lead to bugs and security issues.
Keep as-is indefinitely.
- This doesn't sound great as we could be blocking users in what they want to do.

Depending on the number of workers we have in parallel

@jdrpereira

It's currently hardcoded to 2.

They are started by a cron worker that will check for work (if there is any image to destroy) every 5 minutes.

Allowing users to retry the destruction after exhausting all retries seems the safest option to me. If there is an unknown bug preventing the destruction we should (hopefully) become aware through user reports and fix it (instead of being in the dark about them).

The plan sounds great @adie.po. Thank you for putting this together! Definitely an improvement over infinite retries.

Once this change is in place and we know how often users are encountering deletion_failed_permanently we can decide if it's necessary to provide further details on how to delete the image.

assigned to @adie.po

A without B will still lead to bursts of back-to-back failed executions until the max retries are exhausted. If complexity is not too high (likely not), adding the two would be ideal (I'm afraid we'll end up postponing B indefinitely if done separately).

Thanks @jdrpereira, great point! Noted, then we can have the backoff already together with the max retries.

My question is: what do we do with images in deletion_failed_permanently? How can a user wanting to move the project, progress on that aspect?

Thanks for raising this, @10io, great catch! Thank you also for providing the different ways we can handle the UX of this. Agree with João that allowing users to retry after they have reached the deletion_failed_permanently state is the safest.

Great point, @crystalpoole! Yes, the backend tickets can be scheduled first giving us enough time to build the prerequisites as well as to get feedback on this.

Thank you all! I will create issues for the steps and also write an implementation proposal for the cleanup policies.

that allowing users to retry after they have reached the deletion_failed_permanently state is the safest.

@adie.po Is it possible to add the reason why the deletion_failed_permanently flag needs to be set? I'm afraid that allowing users to retry deletion without fixing the underlying problem could cause deletion to fail again.

Hi @rchanila! Of course

The deletion_failed_permanently flag is now implemented in !166119 (merged)as setting the status of the container repository to delete_failed.

When the container repository deletion has failed 10 times or more, we set its status to delete_failed. With this status, it will no longer be picked up by the background worker that runs regularly to delete container repositories tagged for deletion.

Having the delete_failed status should signal to the user that the container repository probably needs manual checking/fixing since it has failed a lot of times. Once fixed, users should be able to try the deletion again (via the backend, setting the status back to delete_scheduled) since their original intention was to delete the container repository. If the underlying problem is not fixed, then the deletion would retry and if it fails again until it reaches the threshold, then it will be marked delete_failed again.

For now, before #480653 is implemented, the way to reset the delete_status is to manually do it via the rails console -- setting ContainerRepository#status back to delete_scheduled.

We will add documentation regarding this in #499013 (closed).

Thanks for creating the issue! That would useful documentation until #480653 is implemented.

@adie.po Is there was a way to codify this in a delete_failed_reason column when the status delete_failed is set? We can then use this to show the reason in the UI. This is inline with our design guidelines. This implementation can be part of #480653 where we could show

Deletion failure along with reason.
Ability to retry said image.

Hi @rchanila, there can be multiple reasons why the container repository fails for more than 10 times. It can also be a different reason for every time. As of now, we don't store the reason why deletion fails and we also have a generic network error if it's due to network connectivity.

We can add a code snippet in the documentation to share what they can run to simulate the deletion that we do so they can try to replicate the error via the console for more information.

For now, would it be enough to show for now "failed more than x times" and then we can link to the documentation on how to troubleshoot?

we can link to the documentation on how to troubleshoot?

Yep, we can start off with this, thanks! I've updated the issue description.

marked this issue as related to #480651

mentioned in issue #480651

marked this issue as related to #480652 (closed)

marked this issue as related to #480653

This issue is scheduled for completion in this milestone but is not yet in development. Changing health status to 'needs attention'.

Issue participants are welcome to override this by setting the health status to another value.

changed health status to needs attention

added workflowin review label and removed workflowrefinement label

Async Updates

We have created 2 implementation issues to address the endless retries on DeleteContainerRepositoryWorker. They are:

Add max retry and backoff delay for container r... (#480652 - closed) that adds the max retries and backoff delays.
Allow users to retry deletion for container ima... (#480653) that explores allowing users to reset status in the UI.

We have also created an investigation issue for implementing max retries and backoff delay on cleanup policies Investigation: Handle retries on failing delete... (#480651).

With these, we can now close this issue and start working on the implementation. Thank you all for the feedback!

@trizzi, can we schedule the first implementation issue, #480652 (closed), to %17.5 if there's room left? thanks!

I'll try, thanks @adie.po!

closed

marked this issue as related to #499013 (closed)

Investigation: Handle retrying failing deletes that result to infinite loops

Designs

Child items ...

Activity

Overview

Problem

Proposed Solution

A. Set a max number of retries

B. Introduce backoff strategy before attempting to delete again

Implementation Details:

Async Updates

Investigation: Handle retrying failing deletes that result to infinite loops

Relates to

Activity

Overview

Problem

Proposed Solution

A. Set a max number of retries

B. Introduce backoff strategy before attempting to delete again

Implementation Details:

Async Updates