Add max retry and backoff delay for container repository deletes

⛰ Overview

To address endless retries when the deletion of the container repository fails, we want to introduce a maximum number of retries as well as introduce backoff delay for retries

🚴 Implementation Details

A. Set a max number of retries

We can set a maximum number of times we will try to delete a container repository. If it is exceeded, then we tag the container repository with a different status so we don't try to delete it anymore. Administrators can then refer to this status to know and be able to fix which container repository failed. With this, we don't just attempt to delete the container repository endlessly.

Add a field failed_deletion_count to ContainerRepository which we increase every time we attempted to delete the container repository and it fails.
Add a new status: deletion_failed_permanently to ContainerRepository. When picking up the container repository for deletion, if the failed_deletion_count has reached the MAX_FAILED_DELETION_COUNT, then we set the status to deletion_failed_permanently so it will not be picked up again.

It can be helpful to display this status somewhere so users can know that something has failed permanently and need further investigation.

Repositories marked as deletion_failed_permanently will require manual intervention to be reset to delete_scheduled. Once administrators have determined and fixed the problem, they can set the status of the container repository to delete_scheduled and it will be picked up again for deletion.

B. Introduce backoff strategy before attempting to delete again

In addition to the new field failed_deletion_count and new status deletion_failed_permanently in A, we also add a new field next_delete_attempt_at to use a backoff delay so the same repository is not picked up too frequently.

For example, the backoff delay can start with a base of 2 seconds and doubles with each failure (Time.now + 2 ^ failed_deletion_count seconds). The more failed deletes that happened, the more we increase the delays before the container repository is picked up again for deletion, providing time for transient issues to resolve.

We could not only have the initial backoff (e.g. 5s) but also a max backoff (e.g. 24h). If something does not resolve within 24h, then it's likely permanent/data related.

When a container repository's deletion failed,

if failed_deletion_count has reached MAX_FAILED_DELETION_COUNT, then we change the status to deletion_failed_permanently, else:
we increment failed_deletion_count by 1
we set next_delete_attempt_at (i.e. Time.now + 2 ^ failed_deletion_count seconds)

And then when picking up the next container repository for deletion, we filter by those that have next_delete_attempt_at in the past. Filtering by next_delete_attempt_at could necessitate indexing this field to ensure that the worker's performance does not degrade as the number of container repositories grows.

☀ Next Steps

When this issue is closed, we have containers that will be in the state deletion_failed_permanently if they failed multiple times and reached out thresholds. They will require manual intervention to be reset to delete_scheduled so the system can attempt to delete them again.

If there is a need to have a UI for resetting the container repository to delete_scheduled, we then investigate and implement that in a succeeding issue #480653.

Edited Sep 07, 2024 by Adie (she/her)