Graceful BuildGrid Restart (unscheduled)
The following was raised by @edbaunton, as per this email.
Context
In the case of unscheduled downtime (e.g. host crash) restarting BuildGrid has non-ideal downstream effects on its users: per my understanding, their jobs would come back as failed due to a grpc disconnect. Barring requiring all clients of RemoteEx to have retry logic built in; or adding a proxy layer to independently apply the retry logic, I think we can enhance BuildGrid to more readily handle downtime. This might include serialising all state to a persistent store, or operating a hot-backup which replicates state. This would probably require support in the bots/workers to support seemingly flakey connections and not instantly bail on server downtime. Equally for the clients the connection would be broken?
Task Description
Steps required TBC, following discussion.