Skip to content

pg-upgrade-logical: Fix the error "new timeline XX forked off current database system timeline XX before current recovery point"

Vitaliy Kukharik requested to merge pause-stop-patroni into master

Issue: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24150

At the moment, when enable the Patroni cluster maintenance mode (pause) and when stopping the patroni service first on standby leader, the patroni service can change the primary_conninfo parameter to target cluster replicas, which is not the preferred behavior for us in the upgrade procedure, since this can lead to a situation where replicas are ahead (LSN position) than the leader of the cluster target, as a result of which we may get an error due to the fact that physical replication on the target cluster does not work.

  • example of error: "new timeline 20 forked off current database system timeline 19 before current recovery point 9EAAC/F4E41780"

Example how to reproduce:

  1. run the command to monitor the primary_conninfo value on the target cluster replica
for i in {1..3600}; do echo $(date) && gitlab-psql -tAXc "show primary_conninfo" ; sleep 1; done
  1. Enable maintenance mode for the patroni cluster
patronictl pause target-postgres-cluster
  1. Stop patroni service on the Target cluster leader
stop systemctl stop patroni
  1. Check the value of the primary_conninfo parameter as a result of executing the command from step 1

Example:

Mon Jul 31 06:42:55 PM UTC 2023
user=replicator passfile=/var/lib/postgresql/.pgpass_patroni host=10.1.0.4 port=5432 sslmode=prefer application_name=target-pgnode02 gssencmode=prefer
Mon Jul 31 06:42:56 PM UTC 2023
user=replicator passfile=/var/lib/postgresql/.pgpass_patroni host=10.1.0.4 port=5432 sslmode=prefer application_name=target-pgnode02 gssencmode=prefer
Mon Jul 31 06:42:57 PM UTC 2023
user=replicator passfile=/var/lib/postgresql/.pgpass_patroni host=10.1.0.4 port=5432 sslmode=prefer application_name=target-pgnode02 gssencmode=prefer
Mon Jul 31 06:42:58 PM UTC 2023
user=replicator passfile=/var/lib/postgresql/.pgpass_patroni host=10.1.0.4 port=5432 sslmode=prefer application_name=target-pgnode02 gssencmode=prefer
Mon Jul 31 06:42:59 PM UTC 2023
user=replicator passfile=/var/lib/postgresql/.pgpass_patroni host=10.1.0.4 port=5432 sslmode=prefer application_name=target-pgnode02 gssencmode=prefer
Mon Jul 31 06:43:00 PM UTC 2023
user=replicator passfile=/var/lib/postgresql/.pgpass_patroni host=10.1.0.4 port=5432 sslmode=prefer application_name=target-pgnode02 gssencmode=prefer
Mon Jul 31 06:43:02 PM UTC 2023
psql: error: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: No such file or directory
	Is the server running locally and accepting connections on that socket?
Mon Jul 31 06:43:03 PM UTC 2023
user=replicator passfile=/var/lib/postgresql/.pgpass_patroni host=10.1.0.2 port=5432 sslmode=prefer application_name=target-pgnode02 gssencmode=prefer
Mon Jul 31 06:43:04 PM UTC 2023
user=replicator passfile=/var/lib/postgresql/.pgpass_patroni host=10.1.0.2 port=5432 sslmode=prefer application_name=target-pgnode02 gssencmode=prefer
Mon Jul 31 06:43:05 PM UTC 2023
user=replicator passfile=/var/lib/postgresql/.pgpass_patroni host=10.1.0.2 port=5432 sslmode=prefer application_name=target-pgnode02 gssencmode=prefer
Mon Jul 31 06:43:06 PM UTC 2023
user=replicator passfile=/var/lib/postgresql/.pgpass_patroni host=10.1.0.2 port=5432 sslmode=prefer application_name=target-pgnode02 gssencmode=prefer
Mon Jul 31 06:43:07 PM UTC 2023
user=replicator passfile=/var/lib/postgresql/.pgpass_patroni host=10.1.0.2 port=5432 sslmode=prefer application_name=target-pgnode02 gssencmode=prefer
  • 10.1.0.4 - Target cluster leader
  • 10.1.0.2 - Source Cluster leader

What's new

  • after pause the patroni cluster, stop the patroni service first on the replica servers and only then on the standby leader server
  • do not perform WAL archiving during reach recovery_target_lsn
  • pause WAL replay on the Standby Cluster Replicas before stopping PostgreSQL on the Standby Cluster Leader

Additionally

  • increase the timeout to 10 min for the task "Wait until the Patroni cluster is stopped"
  • Update task "Wait until the PostgreSQL replica is synchronized"
    • The query from the "Verify table record on Replicas" task became part of this task instead of getting pg_lag_bytes from pg_stat_replication. This option is more reliable because it eliminates the risk associated with the absence of a window with a 0 byte lag (for example, autovacuum activity), which could lead to a timeout task execution error.
Edited by Vitaliy Kukharik

Merge request reports