Why are patroni failovers occurring so often
Problem statement:
Patroni failover events are expensive and are occurring much more frequently than expected. While a failover is in progress, the read-only replica databases remain available, but the writable primary database is unavailable. This causes all upstream clients to fail any task that requires any interaction with the primary database. For most purposes, GitLab.com is effectively unavailable during this time.
Patroni's failover mechanism is crucial for maintaining high availability of our writable Postgres database, providing efficient and reliable return to service when the writable instance fails or becomes unreachable by its many clients. However, unnecessary failover events harm availability (typically cause 1-3 minutes of downtime) and require hours of manual clean-up and analysis.
Goal:
Reduce the rate of unnecessary failover events, to improve availability and avoid toil.
Discover what triggers the recent Patroni failover events, and propose options to avoid them without sacrificing too much ability to detect and respond to events that really do necessitate failover.
Non-goals:
Reducing the amount of toil associated with failover events is a separate and also desirable goal, but will not be addressed here, except for one point:
- The
statement_timeout
setting should not be applied to the Postgres user account used bypg_rewind
. This consistently aborts the conversion of the old primary into a replica after failover. Making that one configuration change could avoid a significant amount of toil (i.e. replacing or rebuilding the old primary node as a fresh replica).
Reducing the duration of downtime during a failover event is a separate and also desirable goal, but tuning that is not expected to yield significant improvement. The downtime duration consists of 3 phases:
-
Failure detection time: Time between an actual failure and its detection is mainly affected by the health checks' frequency, timeouts, and scope. Tuning failure detection to be more aggressive can sometimes lead to higher false-positive rate. That appears to be the case currently, so reducing our currently high false-positive rate may require increasing the time to detect actual failures. To the best of my knowledge, currently Patroni's failure detection time is at most 40 seconds (
loop_delay
+ttl
). - Leader election: Patroni's leader election process includes a mandatory delay to let the replicas apply as much of the old primary's transactions as possible from the WAL stream. Then the freshest healthy replica is elected to become the new primary. The Postgres timeline is forked, and all other replicas are asked to switch to the new timeline and start consuming new transactions from the new primary.
-
Reconvergence: Clients must reconnect to the new postgres primary db. This time is already quite small because all clients connect to the writable primary postgres instance via a proxy (
pgbouncer
). Only those handful ofpgbouncer
instances must actually reconnect to Patroni's new primary Postgres db.
Background:
In the last couple months Patroni has several times initiated failover of the writable primary Postgres node. Most of those failovers appear to have been unnecessary, at least judging from the availability metrics for GitLab.com prior to failover.
Prior work:
For reference, here are some (but not necessarily all) of the Patroni failover events we investigated:
- 2019-07-17 failover from patroni-01 to patroni-04 and its RCA issue
- 2019-08-14 failover from patroni-01 to patroni-04 and its RCA issue
- 2019-08-27 failover from patroni-07 to patroni-10 and its RCA issue
- 2019-09-03 failover from patroni-10 to patroni-11 and its RCA issue
Several people have independently observed that Patroni failovers are triggered by timeouts during Patroni agent's call to its local Consul agent. Those timeouts are most often observed in Patroni's get_cluster
method, which makes the Patroni loop's 1st of 4 REST calls to the local Consul agent. What causes those timeouts is not yet clear, although several ideas have been proposed, including (but not limited to):
- ephemeral network packet loss, either in general or along the path of consul agent's connection to a consul server
- kernel memory pressure delaying TCP receives
- consul servers undergoing leader election
At least one failover (on 2019-09-03) showed that around the time of the failover, the current Patroni lost its cluster lock -- its consul "session" (mutex) was invalidated. This, too, can be a side-effect of brief interruption between consul agent and consul server, since the "consul session" (which implements the Patroni cluster lock) automatically expires (unlocks the cluster) if not renewed every 15 seconds. Patroni only attempts to renew this session every 10 seconds, so a 5 second delay leads to lock expiry.