pg-upgrade-logical: Fix the error "new timeline XX forked off current database system timeline XX before current recovery point" (!470) · Merge requests · GitLab.com / GitLab Infrastructure Team / db-migration

Vitaliy Kukharik requested to merge pause-stop-patroni into master Jul 31, 2023

Issue: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24150

At the moment, when enable the Patroni cluster maintenance mode (pause) and when stopping the patroni service first on standby leader, the patroni service can change the primary_conninfo parameter to target cluster replicas, which is not the preferred behavior for us in the upgrade procedure, since this can lead to a situation where replicas are ahead (LSN position) than the leader of the cluster target, as a result of which we may get an error due to the fact that physical replication on the target cluster does not work.

example of error: "new timeline 20 forked off current database system timeline 19 before current recovery point 9EAAC/F4E41780"

Example how to reproduce:

run the command to monitor the primary_conninfo value on the target cluster replica

for i in {1..3600}; do echo $(date) && gitlab-psql -tAXc "show primary_conninfo" ; sleep 1; done

Enable maintenance mode for the patroni cluster

patronictl pause target-postgres-cluster

Stop patroni service on the Target cluster leader

stop systemctl stop patroni

Check the value of the primary_conninfo parameter as a result of executing the command from step 1

Example:

Mon Jul 31 06:42:55 PM UTC 2023
user=replicator passfile=/var/lib/postgresql/.pgpass_patroni host=10.1.0.4 port=5432 sslmode=prefer application_name=target-pgnode02 gssencmode=prefer
Mon Jul 31 06:42:56 PM UTC 2023
user=replicator passfile=/var/lib/postgresql/.pgpass_patroni host=10.1.0.4 port=5432 sslmode=prefer application_name=target-pgnode02 gssencmode=prefer
Mon Jul 31 06:42:57 PM UTC 2023
user=replicator passfile=/var/lib/postgresql/.pgpass_patroni host=10.1.0.4 port=5432 sslmode=prefer application_name=target-pgnode02 gssencmode=prefer
Mon Jul 31 06:42:58 PM UTC 2023
user=replicator passfile=/var/lib/postgresql/.pgpass_patroni host=10.1.0.4 port=5432 sslmode=prefer application_name=target-pgnode02 gssencmode=prefer
Mon Jul 31 06:42:59 PM UTC 2023
user=replicator passfile=/var/lib/postgresql/.pgpass_patroni host=10.1.0.4 port=5432 sslmode=prefer application_name=target-pgnode02 gssencmode=prefer
Mon Jul 31 06:43:00 PM UTC 2023
user=replicator passfile=/var/lib/postgresql/.pgpass_patroni host=10.1.0.4 port=5432 sslmode=prefer application_name=target-pgnode02 gssencmode=prefer
Mon Jul 31 06:43:02 PM UTC 2023
psql: error: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: No such file or directory
	Is the server running locally and accepting connections on that socket?
Mon Jul 31 06:43:03 PM UTC 2023
user=replicator passfile=/var/lib/postgresql/.pgpass_patroni host=10.1.0.2 port=5432 sslmode=prefer application_name=target-pgnode02 gssencmode=prefer
Mon Jul 31 06:43:04 PM UTC 2023
user=replicator passfile=/var/lib/postgresql/.pgpass_patroni host=10.1.0.2 port=5432 sslmode=prefer application_name=target-pgnode02 gssencmode=prefer
Mon Jul 31 06:43:05 PM UTC 2023
user=replicator passfile=/var/lib/postgresql/.pgpass_patroni host=10.1.0.2 port=5432 sslmode=prefer application_name=target-pgnode02 gssencmode=prefer
Mon Jul 31 06:43:06 PM UTC 2023
user=replicator passfile=/var/lib/postgresql/.pgpass_patroni host=10.1.0.2 port=5432 sslmode=prefer application_name=target-pgnode02 gssencmode=prefer
Mon Jul 31 06:43:07 PM UTC 2023
user=replicator passfile=/var/lib/postgresql/.pgpass_patroni host=10.1.0.2 port=5432 sslmode=prefer application_name=target-pgnode02 gssencmode=prefer

10.1.0.4 - Target cluster leader
10.1.0.2 - Source Cluster leader

What's new

after pause the patroni cluster, stop the patroni service first on the replica servers and only then on the standby leader server
do not perform WAL archiving during reach recovery_target_lsn
pause WAL replay on the Standby Cluster Replicas before stopping PostgreSQL on the Standby Cluster Leader

Additionally

increase the timeout to 10 min for the task "Wait until the Patroni cluster is stopped"
Update task "Wait until the PostgreSQL replica is synchronized"
- The query from the "Verify table record on Replicas" task became part of this task instead of getting pg_lag_bytes from pg_stat_replication. This option is more reliable because it eliminates the risk associated with the absence of a window with a 0 byte lag (for example, autovacuum activity), which could lead to a timeout task execution error.

Edited Aug 01, 2023 by Vitaliy Kukharik

pg-upgrade-logical: Fix the error "new timeline XX forked off current database system timeline XX before current recovery point"

Example how to reproduce:

What's new

Additionally

Merge request reports