Skip to content

Retry canceled registry migrations

Steve Abrams requested to merge 360732-canceled-migration-retries into master

🏠 Context

We are in the process of migrating all container repositories to the new container registry. This process is driven by a set of background workers in rails. One worker, the GuardWorker, is responsible for checking in to see if any of the imports are taking too long. If they are, it will take some actions based on the status of the import and how long it has been stuck.

The GuardWorker will ask the registry for the true status of the import. If the true status is that it is still importing or pre-importing (meaning it really is stuck), the GuardWorker will send a cancel request to the registry and then abort the import on the rails side. When an import is aborted, the EnqueuerWorker will pick it back up and try it again. The problem is, the EnqueuerWorker will also ask the registry for the true status of the import, which is now import_canceled. The way the logic is currently written, when we see import_canceled we skip the import instead of retrying it. We want to keep retrying until we hit the retry limit as imposed by the application settings (currently 3). This MR updates this logic to allow it to retry.

This was the intention of #359300 (closed) but we got mixed up in when/how the import could end up in a canceled state.

This discussion may also be helpful in understanding how we arrived at these changes.

🛋 What does this MR do and why?

This MR changes the reconciliation logic so that when we receive import_canceled or pre_import_canceled from the container registry, we retry the import or pre-import instead of skipping.

If the import continues to get stuck, the Guard will eventually skip when we hit the retry limit as expected.

No changelog is included as the entire process is behind a feature flag

Screenshots or screen recordings

N/A

How to set up and validate locally

It is difficult to simulate this behavior locally.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related: #360732 (closed)

Edited by Steve Abrams

Merge request reports