Geo: Orphaned uploads lead to "Sync timed out after 28800"

Orphaned uploads apparently lead to "Sync timed out after 28800". It doesn't seem like an appropriate failure mode.

Example: #417164 (closed)

  1. It implies that the sync jobs are exiting without updating state (or worse, hanging forever?).
  2. An orphaned upload is a relatively common occurrence and easy to detect on a secondary site without even having to do a request against the primary site.
  3. It consumes the concurrency limit for 8 hours.

In this case, the secondary should:

  • When attempting to sync a model_record, check if it is lost_orphan? (I added "lost" because it is technically possible for a model_record to know where its data is without its parent... it depends on the implementation of the path).
  • Treat it as a failure immediately without attempting a request for the resource.
  • Log a descriptive failure message, e.g. Upload with ID X is orphaned. Model with class Y and ID Z does not exist in the database.
  • Set the retry_at to a relatively long time from now, e.g. 24 hours, since it is a data integrity problem that will almost always persist forever in the PG database.
Edited by Michael Kozono