Geo: Support efficient restore of a secondary
What do we need to do to confidently answer "Yes, and here's how to do it" to the following common questions?:
If I restore a backup of the secondary, will it avoid unnecessary replication?
If I restore a backup of the primary into a secondary environment, can I configure it as a Geo secondary and will it avoid unnecessary replication?
If I failover to a secondary, can I use the original primary as a secondary without resyncing everything? Or, if I immediately failback to the original primary, can I avoid resyncing everything on the original secondary?
If I rsync (perhaps for a more efficient backfill) the repositories to the secondary, will it avoid unnecessary replication?
There are multiple major components to consider:
- PostgreSQL database
- Repositories
- Blobs
Why?
Many customers have multiple terabytes of data in their Geo primary. Even an extremely fast connection will take a long time to backfill a secondary and will use up a lot of resources. And they already have backups somewhere.
Regarding the rsync question, sometimes this comes up due to a sysadmin's feeling of helplessness when repos are not replicated properly, or verification persistently fails for some repos. I think we have constantly improved on this front and solved corner cases, but in any case, if we solve the backup-restore question, we probably solve this one too.
Proposal
MVP:
-
Confirm there is no security issue with excess repos on disk -
Open an issue to avoid rereplicating existing Blobs based on checksums => #352530 (closed) - [-] Update the failover docs to say it is possible to reuse a primary as a secondary => We intend to do #352530 (closed) already
- Include a caveat: Blob types like uploads and artifacts (add link to replicated data types) may be rereplicated (add link to issue to work on avoiding this inefficiency)
-
Add small section to Geo Troubleshooting doc: Mention you can rsync repos, and Geo will pick them up. => Copied this checkbox to #352530 (closed)
And then:
- [-] Open epic to contain the items below
-
Open issue to: Add a how to doc for creating a secondary from a backup of the primary or another secondary => #352533 - HA first: Attempt it with primary that has a separate DB node
- Include a caveat: Your instance may have data where there should be none, which wastes space.
- Include a caveat: Blob types like uploads and artifacts (add link to replicated data types) may be rereplicated (add link to issue to work on avoiding this inefficiency)
- Weight: 3
-
Open issue to: Add a how to doc for reusing a primary as a secondary after failover => Added as a TODO in #352530 (closed). It's way easier to test this now with GET! - HA first: Attempt it with a primary that has a separate DB node
- Weight: 3
-
Open issue to: Add relevant QA tests, to support this confidently across versions. I don't think it's worth testing the "efficiency" aspect, but we could at least test that a restored secondary isn't broken and becomes up-to-date. => I added Add tests in shared examples for verified SSF data types
as part of the proposal of #352530 (closed). I think that's sufficient.