Pausing replication fails promotion

Problem to solve

In the Pausing and Resuming replication docs we mention that this is useful during upgrades or planned failover, but what happens in a DR scenario when replication is paused? Are there any extra steps to be taken?

We should likely test / document what happens if a DR scenario happens while replication is paused.

Steps to Reproduce

  1. Set up a geo cluster with gitlab-orchestrator, use the Geo DB node for the following steps
  2. Pause replication with gitlab-ctl geo-replication-pause
  3. Update /etc/gitlab/gitlab.rb to remove roles ['geo_secondary_role']
  4. Promote the secondary DB with gitlab-ctl promote-to-primary-node --skip-preflight-check

Expected behavior

Server is promoted to primary server without issue

Actual behavior

Promoting the PostgreSQL to primary...
waiting for server to promote............................................................... stopped waiting
pg_ctl: server did not promote in time

Further details

Initial internal discussion in Slack

From https://repmgr.org/docs/4.3/repmgrd-wal-replay-pause.html

Note: repmgr standby promote will refuse to promote a node in this state, as the PostgreSQL promote command will not be acted on until WAL replay is resumed, leaving the cluster in a potentially unstable state. In this case it is up to the user to decide whether to resume WAL replay.

It's just waiting to promote the secondary to a primary until wal replay is resumed.

Edited by Catalin Irimie