Users receive intermittent failures when only one Praefect node cannot reach Gitaly

When one Praefect node believes the Gitaly nodes are unhealthy but the others do not, requests routed to the impacted node may fail as that Praefect does not have a healthy node to route to.

Here's the specifics of a recent partial outage a customer experienced with this scenario:

One of their three Praefect nodes intermittently experienced failed health check calls to all three Gitaly nodes:

{
  "address": "tcp://10.0.0.1:8075",
  "error": "rpc error: code = DeadlineExceeded desc = context deadline exceeded",
  "level": "warning",
  "msg": "error when pinging healthcheck",
  "pid": 1721,
  "storage": "gitaly-1",
  "time": "2021-03-25T09:49:42.531Z",
  "virtual_storage": "cluster"
}

Neither of the other Praefect nodes experienced this, and Gitaly logs show that they were receiving health checks from all three Praefects at this time and responding quickly. No other errors logged by the problem node, it's unclear why these requests failed. This is the same customer as #3541 (closed), although the problem node was not the one to OoM, so perhaps memory pressure was coming into play in some way.

The problem Praefect node was apparently still able to connect to Postgres and otherwise healthy. As a result, requests to it would intermittently fail with no healthy nodes: primary is not healthy, causing significant user impacts.

/cc @samihiltunen

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information