Recovering from non-terminal stage and restart of backend
There are 4 non-terminal states:
- UPLOADING
- QUEUED
- RUNNING
- COMMITTING (cf #165 )
The following cases need to be paid close attention to:
- in case of failure during those states, ensure that backend leave this state.
- in case of restart of backend, any job found in those states need to be discarded or it's status checked and proper state re-establish. This is not trivial!
An example is RUNNING: when backend restarts, it should check that the job exisits in pachyderm and is still running. If it does not find it, it state should be changed to UNKNOWN. Another example is UPLOADING. It seems that an exception in the uploading code does not trigger the job to FAILED, it just stays there.
Edited by mma227