Upgrade patroni to 2.0.x
While talking to @dbalexandre about issues we are facing with our current patroni implementation, we found-out some indication that most recent patroni release has improved and possibly fixed many of the issues we found.
The first indication was this: https://github.com/zalando/patroni/issues/1752#issuecomment-725304723 where they say as a way to fix the problem:
Please upgrade to 2.0.1.
Changelog between 2.0.1 and 1.6.5
Here is the full changelog between 2.0.1 (latest version as of today) and 1.6.5 (what we ship today):
Version 2.0.1
New features
Use
moreas pager inpatronictl edit-configiflessis not available (Pavel Golub)On Windows it would be the
more.com. In addition to that,cdiffwas changed toydiffinrequirements.txt, butpatronictlstill supports both for compatibility.Added support of
raftbind_addrandpassword(Alexander Kukushkin)
raft.bind_addrmight be useful when running behind NAT.raft.passwordenables traffic encryption (requires thecryptographymodule).Added
sslpasswordconnection parameter support (Kostiantyn Nemchenko)The connection parameter was introduced in PostgreSQL 13.
Stability improvements
Changed the behavior in pause (Alexander)
- Patroni will not call the
bootstrapmethod if thePGDATAdirectory is missing/empty.- Patroni will not exit on sysid mismatch in pause, only log a warning.
- The node will not try to grab the leader key in pause mode if Postgres is running not in recovery (accepting writes) but the sysid doesn't match with the initialize key.
Apply
master_start_timeoutwhen executing crash recovery (Alexander)If Postgres crashed on the leader node, Patroni does a crash-recovery by starting Postgres in single-user mode. During the crash-recovery the leader lock is being updated. If the crash-recovery didn't finish in
master_start_timeoutseconds, Patroni will stop it forcefully and release the leader lock.Removed the
secureextra from theurllib3requirements (Alexander)The only reason for adding it there was the
ipaddressdependency for python 2.7.Bugfixes
Fixed a bug in the
Kubernetes.update_leader()(Alexander)An unhandled exception was preventing demoting the primary when the update of the leader object failed.
Fixed hanging
patronictlwhen RAFT is being used (Alexander)When using
patronictlwith Patroni config,self_addrshould be added to thepartner_addrs.Fixed bug in
get_guc_value()(Alexander)Patroni was failing to get the value of
restore_commandon PostgreSQL 12, therefore fetching missing WALs forpg_rewinddidn't work.Version 2.0.0
This version enhances compatibility with PostgreSQL 13, adds support of multiple synchronous standbys, has significant improvements in handling of
pg_rewind, adds support of Etcd v3 and Patroni on pure RAFT (without Etcd, Consul, or Zookeeper), and makes it possible to optionally call thepre_promote(fencing) script.PostgreSQL 13 support
Don't fire
on_reloadwhen promoting tostandby_leaderon PostgreSQL 13+ (Alexander Kukushkin)When promoting to
standby_leaderwe changeprimary_conninfo, update the role and reload Postgres. Sinceon_role_changeandon_reloadeffectively duplicate each other, Patroni will call onlyon_role_change.Added support for
gssencmodeandchannel_bindingconnection parameters (Alexander)PostgreSQL 12 introduced
gssencmodeand 13channel_bindingconnection parameters and now they can be used if defined in thepostgresql.authenticationsection.Handle renaming of
wal_keep_segmentstowal_keep_size(Alexander)In case of misconfiguration (
wal_keep_segmentson 13 andwal_keep_sizeon older versions) Patroni will automatically adjust the configuration.Use
pg_rewindwith--restore-target-walon 13 if possible (Alexander)On PostgreSQL 13 Patroni checks if
restore_commandis configured and tellspg_rewindto use it.New features
[BETA] Implemented support of Patroni on pure RAFT (Alexander)
This makes it possible to run Patroni without 3rd party dependencies, like Etcd, Consul, or Zookeeper. For HA you will have to run either three Patroni nodes or two nodes with Patroni and one node with
patroni_raft_controller. For more information please check the :ref:documentation <raft_settings>.[BETA] Implemented support for Etcd v3 protocol via gPRC-gateway (Alexander)
Etcd 3.0 was released more than four years ago and Etcd 3.4 has v2 disabled by default. There are also chances that v2 will be completely removed from Etcd, therefore we implemented support of Etcd v3 in Patroni. In order to start using it you have to explicitly create the
etcd3section is the Patroni configuration file.Supporting multiple synchronous standbys (Krishna Sarabu)
It allows running a cluster with more than one synchronous replicas. The maximum number of synchronous replicas is controlled by the new parameter
synchronous_node_count. It is set to 1 by default and has no effect when thesynchronous_modeis set tooff.Added possibility to call the
pre_promotescript (Sergey Dudoladov)Unlike callbacks, the
pre_promotescript is called synchronously after acquiring the leader lock, but before promoting Postgres. If the script fails or exits with a non-zero exitcode, the current node will release the leader lock.Added support for configuration directories (Floris van Nee)
YAML files in the directory loaded and applied in alphabetical order.
Advanced validation of PostgreSQL parameters (Alexander)
In case the specific parameter is not supported by the current PostgreSQL version or when its value is incorrect, Patroni will remove the parameter completely or try to fix the value.
Wake up the main thread when the forced checkpoint after promote completed (Alexander)
Replicas are waiting for checkpoint indication via member key of the leader in DCS. The key is normally updated only once per HA loop. Without waking the main thread up, replicas will have to wait up to
loop_waitseconds longer than necessary.Use of
pg_stat_wal_recevierview on 9.6+ (Alexander)The view contains up-to-date values of
primary_conninfoandprimary_slot_name, while the contents ofrecovery.confcould be stale.Improved handing of IPv6 addresses in the Patroni config file (Mateusz Kowalski)
The IPv6 address is supposed to be enclosed into square brackets, but Patroni was expecting to get it plain. Now both formats are supported.
Added Consul
service_tagsconfiguration parameter (Robert Edström)They are useful for dynamic service discovery, for example by load balancers.
Implemented SSL support for Zookeeper (Kostiantyn Nemchenko)
It requires
kazoo>=2.6.0.Implemented
no_paramsoption for custom bootstrap method (Kostiantyn)It allows calling
wal-g,pgBackRestand other backup tools without wrapping them into shell scripts.Move WAL and tablespaces after a failed init (Feike Steenbergen)
When doing
reinit, Patroni was already removing not onlyPGDATAbut also the symlinked WAL directory and tablespaces. Now themove_data_directory()method will do a similar job, i.e. rename WAL directory and tablespaces and update symlinks in PGDATA.Improved in pg_rewind support
Improved timeline divergence check (Alexander)
We don't need to rewind when the replayed location on the replica is not ahead of the switchpoint or the end of the checkpoint record on the former primary is the same as the switchpoint. In order to get the end of the checkpoint record we use
pg_waldumpand parse its output.Try to fetch missing WAL if
pg_rewindcomplains about it (Alexander)It could happen that the WAL segment required for
pg_rewinddoesn't exist in thepg_waldirectory anymore and thereforepg_rewindcan't find the checkpoint location before the divergence point. Starting from PostgreSQL 13pg_rewindcould userestore_commandfor fetching missing WALs. For older PostgreSQL versions Patroni parses the errors of a failed rewind attempt and tries to fetch the missing WAL by calling therestore_commandon its own.Detect a new timeline in the standby cluster and trigger rewind/reinitialize if necessary (Alexander)
The
standby_clusteris decoupled from the primary cluster and therefore doesn't immediately know about leader elections and timeline switches. In order to detect the fact, thestandby_leaderperiodically checks for new history files inpg_wal.Shorten and beautify history log output (Alexander)
When Patroni is trying to figure out the necessity of
pg_rewind, it could write the content of the history file from the primary into the log. The history file is growing with every failover/switchover and eventually starts taking up too many lines, most of which are not so useful. Instead of showing the raw data, Patroni will show only 3 lines before the current replica timeline and 2 lines after.Improvements on K8s
Get rid of
kubernetespython module (Alexander)The official python kubernetes client contains a lot of auto-generated code and therefore very heavy. Patroni uses only a small fraction of K8s API endpoints and implementing support for them wasn't hard.
Make it possible to bypass the
kubernetesservice (Alexander)When running on K8s, Patroni is usually communicating with the K8s API via the
kubernetesservice, the address of which is exposed in theKUBERNETES_SERVICE_HOSTenvironment variable. Like any other service, thekubernetesservice is handled bykube-proxy, which in turn, depending on the configuration, is either relying on a userspace program oriptablesfor traffic routing. Skipping the intermediate component and connecting directly to the K8s master nodes allows us to implement a better retry strategy and mitigate risks of demoting Postgres when K8s master nodes are upgraded.Sync HA loops of all pods of a Patroni cluster (Alexander)
Not doing so was increasing failure detection time from
ttltottl + loop_wait.Populate
referencesandnodenamein the subsets addresses on K8s (Alexander)Some load-balancers are relying on this information.
Fix possible race conditions in the
update_leader()(Alexander)The concurrent update of the leader configmap or endpoint happening outside of Patroni might cause the
update_leader()call to fail. In this case Patroni rechecks that the current node is still owning the leader lock and repeats the update.Explicitly disallow patching non-existent config (Alexander)
For DCS other than
kubernetesthe PATCH call is failing with an exception due tocluster.configbeingNone, but on Kubernetes it was happily creating the config annotation and preventing writing bootstrap configuration after the bootstrap finished.Fix bug in
pause(Alexander)Replicas were removing
primary_conninfoand restarting Postgres when the leader key was absent, but they should do nothing.Improvements in REST API
Defer TLS handshake until worker thread has started (Alexander, Ben Harris)
If the TLS handshake was done in the API thread and the client-side didn't send any data, the API thread was blocked (risking DoS).
Check
basic-authindependently from client certificate in REST API (Alexander)Previously only the client certificate was validated. Doing two checks independently is an absolutely valid use-case.
Write double
CRLFafter HTTP headers of theOPTIONSrequest (Sergey Burladyan)HAProxy was happy with a single
CRLF, while Consul health-check complained about broken connection and unexpected EOF.
GET /clusterwas showing stale members info for Zookeeper (Alexander)The endpoint was using the Patroni internal cluster view. For Patroni itself it didn't cause any issues, but when exposed to the outside world we need to show up-to-date information, especially replication lag.
Fixed health-checks for standby cluster (Alexander)
The
GET /standby-leaderfor a master andGET /masterfor astandby_leaderwere incorrectly responding with 200.Implemented
DELETE /switchover(Alexander)The REST API call deletes the scheduled switchover.
Created
/readinessand/livenessendpoints (Alexander)They could be useful to eliminate "unhealthy" pods from subsets addresses when the K8s service is used with label selectors.
Enhanced
GET /replicaandGET /asyncREST API health-checks (Krishna, Alexander)Checks now support optional keyword
?lag=<max-lag>and will respond with 200 only if the lag is smaller than the supplied value. If relying on this feature please keep in mind that information about WAL position on the leader is updated only everyloop_waitseconds!Added support for user defined HTTP headers in the REST API response (Yogesh Sharma)
This feature might be useful if requests are made from a browser.
Improvements in patronictl
Don't try to call non-existing leader in
patronictl pause(Alexander)While pausing a cluster without a leader on K8s,
patronictlwas showing warnings that member "None" could not be accessed.Handle the case when member
conn_urlis missing (Alexander)On K8s it is possible that the pod doesn't have the necessary annotations because Patroni is not yet running. It was making
patronictlto fail.Added ability to print ASCII cluster topology (Maxim Fedotov, Alexander)
It is very useful to get overview of the cluster with cascading replication.
Implement
patronictl flush switchover(Alexander)Before that
patronictl flushonly supported cancelling scheduled restarts.Bugfixes
Attribute error during bootstrap of the cluster with existing PGDATA (Krishna)
When trying to create/update the
/historykey, Patroni was accessing theClusterConfigobject which wasn't created in DCS yet.Improved exception handling in Consul (Alexander)
Unhandled exception in the
touch_member()method caused the whole Patroni process to crash.Enforce
synchronous_commit=localfor thepost_initscript (Alexander)Patroni was already doing that when creating users (
replication,rewind), but missing it in the case ofpost_initwas an oversight. As a result, if the script wasn't doing it internally on it's own the bootstrap insynchronous_modewasn't able to finish.Increased
maxsizein the Consul pool manager (ponvenkates)With the default
size=1some warnings were generated.Patroni was wrongly reporting Postgres as running (Alexander)
The state wasn't updated when for example Postgres crashed due to an out-of-disk error.
Put
*intopgpassinstead of missing or empty values (Alexander)If for example the
standby_cluster.portis not specified, thepgpassfile was incorrectly generated.Skip physical replication slot creation on the leader node with special characters (Krishna)
Patroni appeared to be creating a dormant slot (when
slotsdefined) for the leader node when the name contained special chars such as '-' (for e.g. "abc-us-1").Avoid removing non-existent
pg_hba.confin the custom bootstrap (Krishna)Patroni was failing if
pg_hba.confhappened to be located outside of thepgdatadir after custom bootstrap.
source: https://github.com/zalando/patroni/blob/master/docs/releases.rst#version-201
Highlights and comments
Few guesses of improvements that can be fixing issues we observed before:
Changed the behavior in pause (Alexander)
Patroni will not call the
bootstrapmethod if thePGDATAdirectory is missing/empty.Patroni will not exit on sysid mismatch in pause, only log a warning.
The node will not try to grab the leader key in pause mode if Postgres is running not in recovery (accepting writes) but the sysid doesn't match with the initialize key.
We've seen some racing conditions related with bootstrapping and patroni joining an existing cluster.
When we see the cluster id mismatch it is extremely hard to fix and usually requires destroying a bunch of data in order to start from scratch.
By having it not trying to grab the leader key in pause mode, it allows to bootstrap all nodes in advance and them select the one we want to grab the leader key for whatever reason we need that behavior (one possible reason is when you are migrating an existing node with data, you want that one to be the leader in order to replicate to the others).
Added possibility to call the
pre_promotescript (Sergey Dudoladov)Unlike callbacks, the
pre_promotescript is called synchronously after acquiring the leader lock, but before promoting Postgres. If the script fails or exits with a non-zero exitcode, the current node will release the leader lock.
This can be super useful for our upgrading scripts. I believe we would probably rely on this to do the PG 11 to 12 upgrade.
Fixed health-checks for standby cluster (Alexander)
The
GET /standby-leaderfor a master andGET /masterfor astandby_leaderwere incorrectly responding with 200.
We've seen some issues on standby clusters that were not the same as the ones on a regular one, perhaps the health-checks were the ones to blame?
Improved timeline divergence check (Alexander)
We don't need to rewind when the replayed location on the replica is not ahead of the switchpoint or the end of the checkpoint record on the former primary is the same as the switchpoint. In order to get the end of the checkpoint record we use
pg_waldumpand parse its output.Try to fetch missing WAL if
pg_rewindcomplains about it (Alexander)It could happen that the WAL segment required for
pg_rewinddoesn't exist in thepg_waldirectory anymore and thereforepg_rewindcan't find the checkpoint location before the divergence point. Starting from PostgreSQL 13pg_rewindcould userestore_commandfor fetching missing WALs. For older PostgreSQL versions Patroni parses the errors of a failed rewind attempt and tries to fetch the missing WAL by calling therestore_commandon its own.Detect a new timeline in the standby cluster and trigger rewind/reinitialize if necessary (Alexander)
The
standby_clusteris decoupled from the primary cluster and therefore doesn't immediately know about leader elections and timeline switches. In order to detect the fact, thestandby_leaderperiodically checks for new history files inpg_wal.
I believe this chunk of improvements is what will fix many of the problems we are seeing when trying to bootstrap a patroni cluster