POC: Investigate pausing database replication

It is currently not known if it is feasible to enable pausing database replication to our existing pause and resume logic.

The problem that we could solve here would be to avoid the following situation:

  • Upgrade primary
  • Primary breaks
  • Rollback primary
  • Uh oh, data was lost somewhere
  • Failover to DR secondary, but wait, data is already lost there as well
  • Restore backup of primary (ouch)

Proposal

  1. Re-visit our current "pause and resume" logic
  2. Determine feasibility of @vsizov's idea of "pausing" replication (on secondary. The secondary will still receive WAL logs but won't apply them. In this case, the replication slot on the primary does not need to retain WAL files. In this case, only the secondary's disk is responsible for retaining WAL files.
  3. Create POC versions of gitlab-rake geo:pause and gitlab-rake geo:resume that can be executed on the secondary

Definition of done for POC

  • What is the weight of a command like gitlab-rake geo:pause? (Plus its inverse gitlab-rake geo:resume)

    • These may use pg_wal_replay_pause() and pg_wal_replay_resume()
  • And does it "just work", or are there problems attached that make it not worth it?

Edited by Fabian Zimmer