Skip to content

Remote-execution connection recovering (technical debt)

Background

Initial remote-execution implementation (!626 (merged)) does not handle gRPC connection failures during long-running Operation execution, nor does it tries to catch-up with Operation execution states from a potential reconnection.

The REAPI declares a WaitExecution() call that should help reopening an Operation stream given an Operation name.

Task description

Implementation should include:

  • Handle network failures while pooling on Operation status.
  • Try to reconnect when such a failure happens.
  • Resume Operation status polling if reconnection succeed.
Edited by Martin Blanchard
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information