Skip to content

Pausing replication fails promotion

Problem to solve

In the Pausing and Resuming replication docs we mention that this is useful during upgrades or planned failover, but what happens in a DR scenario when replication is paused? Are there any extra steps to be taken?

We should likely test / document what happens if a DR scenario happens while replication is paused.

Steps to Reproduce

  1. Set up a geo cluster with gitlab-orchestrator, use the Geo DB node for the following steps
  2. Pause replication with gitlab-ctl geo-replication-pause
  3. Update /etc/gitlab/gitlab.rb to remove roles ['geo_secondary_role']
  4. Promote the secondary DB with gitlab-ctl promote-to-primary-node --skip-preflight-check

Expected behavior

Server is promoted to primary server without issue

Actual behavior

Promoting the PostgreSQL to primary...
waiting for server to promote............................................................... stopped waiting
pg_ctl: server did not promote in time

Further details

Initial internal discussion in Slack

From https://repmgr.org/docs/4.3/repmgrd-wal-replay-pause.html

Note: repmgr standby promote will refuse to promote a node in this state, as the PostgreSQL promote command will not be acted on until WAL replay is resumed, leaving the cluster in a potentially unstable state. In this case it is up to the user to decide whether to resume WAL replay.

It's just waiting to promote the secondary to a primary until wal replay is resumed.

Edited by Catalin Irimie