Patroni: Consider enabling pg_rewind flag by default
Summary
When an instance gets out of sync of others, due to leader promotion/switchover or due to splitbrain, it may be hard to recover from that state.
An inconsistent state may look like the following, when watching patroni logs:
2020-10-26_12:53:21.27083 FATAL: could not start WAL streaming: ERROR: replication slot "gabriel_patroni_secondary_geo_patroni_2_c_group_geo_f9c951_inte" does not exist
2020-10-26_12:53:26.27482 NOTICE: identifier "gabriel-patroni-secondary-geo-patroni-2.c.group-geo-f9c951.internal" will be truncated to "gabriel-patroni-secondary-geo-patroni-2.c.g
roup-geo-f9c951.inte"
If you type gitlab-ctl patroni members
, the misbehaving node will not be part of it. If you check consul, you will still see it with: sudo /opt/gitlab/embedded/bin/consul members
, which proves it's only patroni that are not accepting the node.
Proposal
There are probably a more manual operation we can use here, but what I found to always work is to have pg_rewind flag enabled via /etc/gitlab/gitlab.rb
:
patroni['use_pg_rewind'] = true
This poses the question whether we should make it enabled by default or if there are significant risks to that.
If we think it has risks for the primary node, we could consider enabling it only on a Geo secondary (Standby Leader).
References
Documentation: https://patroni.readthedocs.io/en/latest/SETTINGS.html?highlight=rewind#postgresql