Revisit Retry Behavior for Background Migrations When Function Is Not Found
Context
Currently, when a background migration encounters the error work function not found, it is marked as failed. This behavior assumes a permanent misconfiguration. However, in scenarios such as rolling upgrades, this may be a temporary state where:
- The migration entry exists in the database.
- Some pods are still running older versions of the registry that do not include the migration function.
- Newer pods (once deployed) will be capable of successfully running the migration.
Problem
Marking the migration as failed immediately in these situations can lead to manual intervention and adds friction during deployments or upgrades.
Proposal
Evaluate whether the system should retry the migration a limited number of times or defer execution when the migration function is not yet available, rather than failing immediately.
Possible Solutions:
- Introduce a retry strategy for
work function not founderrors.- Allow for a configurable backoff or retry window before marking as failed.
Benefits:
- Removes need for manual resets of background migrations.
- Improves reliability during rolling upgrades.
- Makes migration system more resilient to transient inconsistencies.
Edited by SAhmed