gitlab-ctl replicate-geo-database fails for large database (again)

Per gitlab-com/migration#161 (closed)

We've seen replicate-geo-database succeed for GitLab.com's database in the past, but in this particular instance it fails after several hours with an error:

pg_basebackup: could not read COPY data: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

The only differences we're aware of are:

Presence of an ipsec tunnel
Slightly increased latency between the two sites
The GitLab.com database is a bit larger than it was last week, or the week before that

How do we make the initial replication process more robust? Ideally, it would be able to survive the TCP connection being lost without having to redo the entire process from the beginning again (which is expensive for large databases, and may mean that we can never replicate using this script for larger ones).

Separately, I noticed that we're not using -X stream --slot <name> in the pg_basebackup invocation, which:

means that after the pg_basebackup has finished restoring the data it may fail because the master has expired the xlog it needs

This isn't the cause of the above error, and since we create a slot separately, I think it's all OK, but perhaps there's an improvement to be made here too.

/cc @_stark @brodock @stanhu @ibaum

Edited Sep 02, 2020 by 🤖 GitLab Bot 🤖