POC: Investigate pausing database replication
It is currently not known if it is feasible to enable pausing database replication to our existing pause and resume logic.
The problem that we could solve here would be to avoid the following situation:
- Upgrade primary
- Primary breaks
- Rollback primary
- Uh oh, data was lost somewhere
- Failover to DR secondary, but wait, data is already lost there as well
- Restore backup of primary (ouch)
Proposal
- Re-visit our current "pause and resume" logic
- Determine feasibility of @vsizov's idea of "pausing" replication (on secondary. The secondary will still receive WAL logs but won't apply them. In this case, the replication slot on the primary does not need to retain WAL files. In this case, only the secondary's disk is responsible for retaining WAL files.
- Create POC versions of
gitlab-rake geo:pauseandgitlab-rake geo:resumethat can be executed on the secondary
Definition of done for POC
-
What is the weight of a command like
gitlab-rake geo:pause? (Plus its inversegitlab-rake geo:resume)- These may use
pg_wal_replay_pause()andpg_wal_replay_resume()
- These may use
-
And does it "just work", or are there problems attached that make it not worth it?
Edited by Fabian Zimmer