Incorrect error semantics cause transactions to hang in failure scenarios
Several RPCs in Gitaly have weird semantics when it comes to error returns. Many of the Operations
service's RPCs, like UserCommitFiles
, return an error message as a string in the response body with status OK. This can cause transactions to hang around for unnecessarily long if the primary fails the RPC. As Praefect sees the primary returning with a success, it doesn't cancel the secondary streams. This leaves the secondaries stuck waiting for the primary to vote until finally timing out around ~50s in.
It might explain some deadline exceeded errors we see, and isn't generally a nice experience for the user.
With !3494 (merged) this actually causes incorrect results. If the primary fails after receiving a commit, Praefect doesn't receive a proper error code to indicate the primary may not have persisted its changes. The hook errors are returned as strings in the response body.
If Praefect received a proper error code, it would cancel the proyxing and thus propagate cancellation to secondaries. We should change these RPCs to return proper error codes instead of error strings in the response.
/cc @zj-gitlab @mjwood