pg-upgrade-logical: Fix the error "new timeline XX forked off current database system timeline XX before current recovery point"
Issue: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24150
At the moment, when enable the Patroni cluster maintenance mode (pause) and when stopping the patroni service first on standby leader, the patroni service can change the primary_conninfo
parameter to target cluster replicas, which is not the preferred behavior for us in the upgrade procedure, since this can lead to a situation where replicas are ahead (LSN position) than the leader of the cluster target, as a result of which we may get an error due to the fact that physical replication on the target cluster does not work.
- example of error: "
new timeline 20 forked off current database system timeline 19 before current recovery point 9EAAC/F4E41780
"
Example how to reproduce:
- run the command to monitor the primary_conninfo value on the target cluster replica
for i in {1..3600}; do echo $(date) && gitlab-psql -tAXc "show primary_conninfo" ; sleep 1; done
- Enable maintenance mode for the patroni cluster
patronictl pause target-postgres-cluster
- Stop patroni service on the Target cluster leader
stop systemctl stop patroni
- Check the value of the
primary_conninfo
parameter as a result of executing the command from step 1
Example:
Mon Jul 31 06:42:55 PM UTC 2023
user=replicator passfile=/var/lib/postgresql/.pgpass_patroni host=10.1.0.4 port=5432 sslmode=prefer application_name=target-pgnode02 gssencmode=prefer
Mon Jul 31 06:42:56 PM UTC 2023
user=replicator passfile=/var/lib/postgresql/.pgpass_patroni host=10.1.0.4 port=5432 sslmode=prefer application_name=target-pgnode02 gssencmode=prefer
Mon Jul 31 06:42:57 PM UTC 2023
user=replicator passfile=/var/lib/postgresql/.pgpass_patroni host=10.1.0.4 port=5432 sslmode=prefer application_name=target-pgnode02 gssencmode=prefer
Mon Jul 31 06:42:58 PM UTC 2023
user=replicator passfile=/var/lib/postgresql/.pgpass_patroni host=10.1.0.4 port=5432 sslmode=prefer application_name=target-pgnode02 gssencmode=prefer
Mon Jul 31 06:42:59 PM UTC 2023
user=replicator passfile=/var/lib/postgresql/.pgpass_patroni host=10.1.0.4 port=5432 sslmode=prefer application_name=target-pgnode02 gssencmode=prefer
Mon Jul 31 06:43:00 PM UTC 2023
user=replicator passfile=/var/lib/postgresql/.pgpass_patroni host=10.1.0.4 port=5432 sslmode=prefer application_name=target-pgnode02 gssencmode=prefer
Mon Jul 31 06:43:02 PM UTC 2023
psql: error: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: No such file or directory
Is the server running locally and accepting connections on that socket?
Mon Jul 31 06:43:03 PM UTC 2023
user=replicator passfile=/var/lib/postgresql/.pgpass_patroni host=10.1.0.2 port=5432 sslmode=prefer application_name=target-pgnode02 gssencmode=prefer
Mon Jul 31 06:43:04 PM UTC 2023
user=replicator passfile=/var/lib/postgresql/.pgpass_patroni host=10.1.0.2 port=5432 sslmode=prefer application_name=target-pgnode02 gssencmode=prefer
Mon Jul 31 06:43:05 PM UTC 2023
user=replicator passfile=/var/lib/postgresql/.pgpass_patroni host=10.1.0.2 port=5432 sslmode=prefer application_name=target-pgnode02 gssencmode=prefer
Mon Jul 31 06:43:06 PM UTC 2023
user=replicator passfile=/var/lib/postgresql/.pgpass_patroni host=10.1.0.2 port=5432 sslmode=prefer application_name=target-pgnode02 gssencmode=prefer
Mon Jul 31 06:43:07 PM UTC 2023
user=replicator passfile=/var/lib/postgresql/.pgpass_patroni host=10.1.0.2 port=5432 sslmode=prefer application_name=target-pgnode02 gssencmode=prefer
- 10.1.0.4 - Target cluster leader
- 10.1.0.2 - Source Cluster leader
What's new
- after pause the patroni cluster, stop the patroni service first on the replica servers and only then on the standby leader server
- do not perform WAL archiving during reach recovery_target_lsn
- pause WAL replay on the Standby Cluster Replicas before stopping PostgreSQL on the Standby Cluster Leader
Additionally
- increase the timeout to 10 min for the task "Wait until the Patroni cluster is stopped"
- Update task "Wait until the PostgreSQL replica is synchronized"
- The query from the "Verify table record on Replicas" task became part of this task instead of getting pg_lag_bytes from pg_stat_replication. This option is more reliable because it eliminates the risk associated with the absence of a window with a 0 byte lag (for example, autovacuum activity), which could lead to a timeout task execution error.
Edited by Vitaliy Kukharik