Geo: Document how to handle a repmgr failover
I believe that replication will break when the primary is using an HA PostgreSQL cluster using repmgr
and a PG failover happens to another DB node. We should document this and make a recommendation as to mitigation steps.
Situation
- Initial setup
graph TD
subgraph Geo deployment
subgraph Primary[Primary site, main rpmgr cluster]
DB_1[DB main leader] --> DB_2[DB follower]
DB_1[DB main leader] --> DB_3[DB follower]
end
subgraph Secondary[Secondary site]
DB_1[DB main leader] --> DB_4["DB Geo standby"]
end
end
- PG failover on primary site
A new node is promoted via repmgr; geo replication breaks and manual intervention is needed
graph TD
subgraph Geo deployment
subgraph Primary[Primary site, main rpmgr cluster]
DB_2[DB new leader]
DB_2[DB new leader] --> DB_3[DB follower]
end
subgraph Secondary[Secondary site]
DB_4["DB Geo standby"]
end
end
Proposal
- Document the situation
- Add a section to troubleshooting guide to recommend mitigation steps