Revisit Retry Behavior for Background Migrations When Function Is Not Found

Context

Currently, when a background migration encounters the error work function not found, it is marked as failed. This behavior assumes a permanent misconfiguration. However, in scenarios such as rolling upgrades, this may be a temporary state where:

  • The migration entry exists in the database.
  • Some pods are still running older versions of the registry that do not include the migration function.
  • Newer pods (once deployed) will be capable of successfully running the migration.

Problem

Marking the migration as failed immediately in these situations can lead to manual intervention and adds friction during deployments or upgrades.

Proposal

Evaluate whether the system should retry the migration a limited number of times or defer execution when the migration function is not yet available, rather than failing immediately.

Possible Solutions:

  • Introduce a retry strategy for work function not found errors.
    • Allow for a configurable backoff or retry window before marking as failed.

Benefits:

  • Removes need for manual resets of background migrations.
  • Improves reliability during rolling upgrades.
  • Makes migration system more resilient to transient inconsistencies.
Edited by SAhmed