Add max retry and backoff delay for container repository deletes
⛰ Overview
To address endless retries when the deletion of the container repository fails, we want to introduce a maximum number of retries as well as introduce backoff delay for retries
🚴 Implementation Details
A. Set a max number of retries
We can set a maximum number of times we will try to delete a container repository. If it is exceeded, then we tag the container repository with a different status so we don't try to delete it anymore. Administrators can then refer to this status to know and be able to fix which container repository failed. With this, we don't just attempt to delete the container repository endlessly.
- Add a field
failed_deletion_counttoContainerRepositorywhich we increase every time we attempted to delete the container repository and it fails. - Add a new status:
deletion_failed_permanentlytoContainerRepository. When picking up the container repository for deletion, if thefailed_deletion_counthas reached theMAX_FAILED_DELETION_COUNT, then we set the status todeletion_failed_permanentlyso it will not be picked up again.
It can be helpful to display this status somewhere so users can know that something has failed permanently and need further investigation.
Repositories marked as deletion_failed_permanently will require manual intervention to be reset to delete_scheduled. Once administrators have determined and fixed the problem, they can set the status of the container repository to delete_scheduled and it will be picked up again for deletion.
B. Introduce backoff strategy before attempting to delete again
In addition to the new field failed_deletion_count and new status deletion_failed_permanently in A, we also add a new field next_delete_attempt_at to use a backoff delay so the same repository is not picked up too frequently.
For example, the backoff delay can start with a base of 2 seconds and doubles with each failure (Time.now + 2 ^ failed_deletion_count seconds). The more failed deletes that happened, the more we increase the delays before the container repository is picked up again for deletion, providing time for transient issues to resolve.
We could not only have the initial backoff (e.g. 5s) but also a max backoff (e.g. 24h). If something does not resolve within 24h, then it's likely permanent/data related.
When a container repository's deletion failed,
- if
failed_deletion_counthas reachedMAX_FAILED_DELETION_COUNT, then we change the status todeletion_failed_permanently, else: - we increment
failed_deletion_countby 1 - we set
next_delete_attempt_at(i.e.Time.now+ 2 ^failed_deletion_countseconds)
And then when picking up the next container repository for deletion, we filter by those that have next_delete_attempt_at in the past. Filtering by next_delete_attempt_at could necessitate indexing this field to ensure that the worker's performance does not degrade as the number of container repositories grows.
☀ Next Steps
When this issue is closed, we have containers that will be in the state deletion_failed_permanently if they failed multiple times and reached out thresholds. They will require manual intervention to be reset to delete_scheduled so the system can attempt to delete them again.
If there is a need to have a UI for resetting the container repository to delete_scheduled, we then investigate and implement that in a succeeding issue #480653.