Reduce Patroni sensitivity to transient Consul SerfCheck failures

Goal:

Tune or replace the Consul SerfCheck, so that the Patroni leader does not lose its cluster_lock before its TTL expires. The SerfCheck currently detects when clients cannot reach a Patroni node. Disabling SerfCheck would mean Patroni would no longer know to failover if its clients cannot reach it. However, on an unreliable network, transient SerfCheck failures have caused unwanted Patroni failovers.

The overall goal is to improve availability of the writable Postgres instance by avoiding unnecessary Patroni failovers during very brief network disruptions but still allowing Patroni failovers during network disruptions lasting more than a modest timeout (e.g. 30-60 seconds). Disabling SerfCheck is implicitly favoring one failover mode over another; SerfCheck works well on a reliable network, but it causes failovers a little too aggressively on an unreliable network.

Background:

In recent months, most of the Patroni failover events have been triggered by brief network connectivity disruptions. When the failover itself takes significantly longer to complete than the health check failure takes to resolve, then availability would have been higher if we had waited a little longer before failover.

Root cause analysis has identified 3 known ways that Patroni failovers have been triggered by intermittent network disruptions. The current issue aims to mitigate Scenario C from the linked notes, copied below for convenience:

Scenario C:

Cause: Serf-LAN health check messages (UDP port 8301) are dropped in one or both directions on the network path to Patroni leader from any other Consul agent. And the Consul agent on the Patroni leader is too slow in refuting that suspicion.

Effect: The Consul agent on a non-Patroni host declares suspicion that Patroni leader has failed. Patroni leader has a limited window to refute this suspicion, which it can learn about via gossip with other Consul agents. If Patroni leader does not promptly refute this suspicion, the Consul server invalidates the Patroni leader's cluster_lock (even before its ttl expires), leaving the Patroni cluster leaderless. The Patroni replicas detect this and begin the Patroni failover procedure.

Remedies:

...

Tuning or replacing the Serf check as a dependency of the Patroni cluster_lock (mentioned here but not yet ticketed) would reduce or eliminate the chances of Scenario C. If the serfCheck is replaced, we must be careful in designing the new health check's failure modes. (For example, we must ensure that planned maintenance does not implicitly break the new health check, because doing so would trigger a Patroni failover -- the very thing we want to avoid when unnecessary.)

Caveat:

Since completing the above analysis, we have learned more about the GCP networking infrastructure upon which our service stack currently runs. We still believe Consul was correctly detecting brief network connectivity outages.

We do not have long-term data, but it's worth noting that the frequency of network disruptions as detected by Consul SerfCheck appears to be lower in the last few weeks (see below) than it was in early September. (Note that some of the SerfCheck failures shown below are from reboots, many of which were unplanned kernel panics, an unrelated but also interesting issue being tracked here.)

msmiley@web-35-sv-gprd.c.gitlab-production.internal:~$ ( ls -1tr /var/log/syslog*gz | xargs -r sudo zcat ; sudo cat /var/log/syslog{.1,} ) | egrep 'consul.[0-9]*.:.*2019/' | grep 'EventMemberFailed' | tee /tmp/results.out | wc -l
20

Consequently, Patroni has not triggered failovers recently. If that trend persists, then this issue for tuning or replacing SerfCheck is unnecessary. But we have no reason to believe that GCP network has become more reliable in the last few weeks.