gitlab-ctl replicate-geo-database fails for large database (again)
Per gitlab-com/migration#161 (closed)
We've seen replicate-geo-database
succeed for GitLab.com's database in the past, but in this particular instance it fails after several hours with an error:
pg_basebackup: could not read COPY data: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The only differences we're aware of are:
- Presence of an ipsec tunnel
- Slightly increased latency between the two sites
- The GitLab.com database is a bit larger than it was last week, or the week before that
How do we make the initial replication process more robust? Ideally, it would be able to survive the TCP connection being lost without having to redo the entire process from the beginning again (which is expensive for large databases, and may mean that we can never replicate using this script for larger ones).
Separately, I noticed that we're not using -X stream --slot <name>
in the pg_basebackup
invocation, which:
means that after the pg_basebackup has finished restoring the data it may fail because the master has expired the xlog it needs
This isn't the cause of the above error, and since we create a slot separately, I think it's all OK, but perhaps there's an improvement to be made here too.