Skip to content

Geo: Support efficient restore of a secondary

What do we need to do to confidently answer "Yes, and here's how to do it" to the following common questions?:

If I restore a backup of the secondary, will it avoid unnecessary replication?

If I restore a backup of the primary into a secondary environment, can I configure it as a Geo secondary and will it avoid unnecessary replication?

If I failover to a secondary, can I use the original primary as a secondary without resyncing everything? Or, if I immediately failback to the original primary, can I avoid resyncing everything on the original secondary?

If I rsync (perhaps for a more efficient backfill) the repositories to the secondary, will it avoid unnecessary replication?

There are multiple major components to consider:

  • PostgreSQL database
  • Repositories
  • Blobs

Why?

Many customers have multiple terabytes of data in their Geo primary. Even an extremely fast connection will take a long time to backfill a secondary and will use up a lot of resources. And they already have backups somewhere.

Regarding the rsync question, sometimes this comes up due to a sysadmin's feeling of helplessness when repos are not replicated properly, or verification persistently fails for some repos. I think we have constantly improved on this front and solved corner cases, but in any case, if we solve the backup-restore question, we probably solve this one too.

Proposal

MVP:

  • Confirm there is no security issue with excess repos on disk
  • Open an issue to avoid rereplicating existing Blobs based on checksums => #352530 (closed)
  • [-] Update the failover docs to say it is possible to reuse a primary as a secondary => We intend to do #352530 (closed) already
    • Include a caveat: Blob types like uploads and artifacts (add link to replicated data types) may be rereplicated (add link to issue to work on avoiding this inefficiency)
  • Add small section to Geo Troubleshooting doc: Mention you can rsync repos, and Geo will pick them up. => Copied this checkbox to #352530 (closed)

And then:

  • [-] Open epic to contain the items below
  • Open issue to: Add a how to doc for creating a secondary from a backup of the primary or another secondary => #352533
    • HA first: Attempt it with primary that has a separate DB node
    • Include a caveat: Your instance may have data where there should be none, which wastes space.
    • Include a caveat: Blob types like uploads and artifacts (add link to replicated data types) may be rereplicated (add link to issue to work on avoiding this inefficiency)
    • Weight: 3
  • Open issue to: Add a how to doc for reusing a primary as a secondary after failover => Added as a TODO in #352530 (closed). It's way easier to test this now with GET!
    • HA first: Attempt it with a primary that has a separate DB node
    • Weight: 3
  • Open issue to: Add relevant QA tests, to support this confidently across versions. I don't think it's worth testing the "efficiency" aspect, but we could at least test that a restored secondary isn't broken and becomes up-to-date. => I added Add tests in shared examples for verified SSF data types as part of the proposal of #352530 (closed). I think that's sufficient.
Edited by Michael Kozono