Update registry migration guard logic
Problem
In the production rollout, we hit the scenario where a pre-import got stuck. After 30 minutes it canceled and skipped as expected. This brought on a few questions:
- Do we want to skip in this case or retry a few more times?
- When we have a canceled migration, how can we tell if rails canceled it or the registry canceled it
- We probably don't need to wait 30 minutes.
- We should be re-enqueueing after a container repository is skipped to keep the migration running.
Solution
To address these questions we will:
- Re-enqueue after a container repository is skipped.
- Add a new skipped state:
migration_canceled_registry
. This will only be used if the cancelation is coming from the registry side - Update the
long_running_migration_threshold
to 10 minutes - If an import or preimport is canceled by rails, allow it to retry until it hits the max retries. In this case, we need to cancel from rails before retrying, so some logic updates will be needed in the state machine.
- Add a capacity feature flag of 2 so we have less of jump from 1 to 10.
Edited by Steve Abrams