PGUpgrade script not working on patroni standby leaders
Discovered in #6133 (comment 883762326)
Copied from that discussion:
At the moment whilst trying to upgrade to PG13 in a patroni cluster. The primary will upgrade without issue, however the secondary is currently failing with The source cluster was shut down while in recovery mode. To upgrade, use "rsync" as documented or shut it down as a primary
this is on the standby leader. It looks like we are getting
2022-03-23_11:12:31.39258 FATAL: database system identifier differs between the primary and standby
2022-03-23_11:12:31.39387 DETAIL: The primary's identifier is 7078249951701215667, the standby's identifier is 7077888904406040401.
so the upgrade fails. Usually we can remove the cluster with /opt/gitlab/embedded/bin/patronictl -c /var/opt/gitlab/patroni/patroni.yaml remove postgresql-ha
but it still fails to update as it can't join the new cluster because the new cluster is on pg13 and the standby leader is on 12
The problem seems to stem from when we upgrade the primary site, during the upgrade we are creating a new patroni cluster with a new ID. I don't know if we did this before or not as i don't recall having this issue with PG12.
Before: Cluster: postgresql-ha (7080081066190363607)
After: Cluster: postgresql-ha (7080128436831724847)
Once the primary is upgraded the secondary is already unable to replicate due to the new identifier issue. We have a similar issue during Geo setup and have to run
- gitlab-ctl stop patroni
- rm -rf /var/opt/gitlab/postgresql/data/
- /opt/gitlab/embedded/bin/patronictl -c /var/opt/gitlab/patroni/patroni.yaml remove postgresql-ha
- gitlab-ctl reconfigure
- gitlab-ctl restart
However this doesn't work during the upgrade. After running the restart the logs now show
2022-03-28_12:46:42.32723 2022-03-28 12:46:42,326 INFO: trying to bootstrap a new standby leader
2022-03-28_12:46:42.33296 2022-03-28 12:46:42,331 INFO: Lock owner: None; I am geo-3k-west2-postgres-1.c.gitlab-qa-geo-986758.internal
2022-03-28_12:46:42.33308 2022-03-28 12:46:42,331 INFO: not healthy enough for leader race
2022-03-28_12:46:42.33433 2022-03-28 12:46:42,333 INFO: bootstrap_standby_leader in progress
2022-03-28_12:46:42.41497 pg_basebackup: error: incompatible server version 13.3
2022-03-28_12:46:42.41536 pg_basebackup: removing data directory "/var/opt/gitlab/postgresql/data"
2022-03-28_12:46:42.41675 2022-03-28 12:46:42,416 ERROR: Error when fetching backup: pg_basebackup exited with code=1
and running the upgrade now fails with
---- Begin output of du -s --block-size=1m /var/opt/gitlab/postgresql/data ----
STDOUT:
STDERR: du: cannot access '/var/opt/gitlab/postgresql/data': No such file or directory
---- End output of du -s --block-size=1m /var/opt/gitlab/postgresql/data ----
From @twk3 thanks for the details. I think we should copy this thread to a new issue for distribution to tackle. I think what we need > is a new codepath during pg-upgrade for standby leader. It needs to
- remove the id from secondary consul (this is something the leader codepath does)
- Blow away the data directory to allow the new sync (this is something the replica codepath does)
From @pursultani This is due to different
pg_controldata
which leads to a new cluster ID. As a result, when we're upgrading a Patroni cluster we are effectively creating a new one. The only difference is that the leader treats the existing database as a standalone database.