Pausing replication fails promotion
Problem to solve
In the Pausing and Resuming replication docs we mention that this is useful during upgrades or planned failover, but what happens in a DR scenario when replication is paused? Are there any extra steps to be taken?
We should likely test / document what happens if a DR scenario happens while replication is paused.
Steps to Reproduce
- Set up a geo cluster with
gitlab-orchestrator, use the Geo DB node for the following steps - Pause replication with
gitlab-ctl geo-replication-pause - Update
/etc/gitlab/gitlab.rbto removeroles ['geo_secondary_role'] - Promote the secondary DB with
gitlab-ctl promote-to-primary-node --skip-preflight-check
Expected behavior
Server is promoted to primary server without issue
Actual behavior
Promoting the PostgreSQL to primary...
waiting for server to promote............................................................... stopped waiting
pg_ctl: server did not promote in time
Further details
Initial internal discussion in Slack
From https://repmgr.org/docs/4.3/repmgrd-wal-replay-pause.html
Note: repmgr standby promote will refuse to promote a node in this state, as the PostgreSQL promote command will not be acted on until WAL replay is resumed, leaving the cluster in a potentially unstable state. In this case it is up to the user to decide whether to resume WAL replay.
It's just waiting to promote the secondary to a primary until wal replay is resumed.