Skip to content

Graceful BuildGrid Restart (unscheduled)

The following was raised by @edbaunton, as per this email.

Context

In the case of unscheduled downtime (e.g. host crash) restarting BuildGrid has non-ideal downstream effects on its users: per my understanding, their jobs would come back as failed due to a grpc disconnect. Barring requiring all clients of RemoteEx to have retry logic built in; or adding a proxy layer to independently apply the retry logic, I think we can enhance BuildGrid to more readily handle downtime. This might include serialising all state to a persistent store, or operating a hot-backup which replicates state. This would probably require support in the bots/workers to support seemingly flakey connections and not instantly bail on server downtime. Equally for the clients the connection would be broken?

Task Description

Steps required TBC, following discussion.

Edited by Beth