postgres replication failing from the gcp secondary
Summary
We are initiating geo replication from postgres-01.db.gprd.gitlab.com from postgres-04.db.prd.gitlab.com, a secondary of the primary gitlab.com database.
We have attempted to initiate replication a few times and all cases it has failed. The most recent attempt saw us get up to around 30% which was hours after the initiation of the replication process.
Troubleshooting so far
- I have tried to pull logs from the ipsec tunnel, virtual network gateway on azure and the vpn connection on Google. There is a very limited view on what is going on here, it looks fine on both side.
The last two attempts: https://console.cloud.google.com/interconnect/vpn/tunnels/details/us-east1/gcp-azure-gprd?project=gitlab-production&tab=monitoring&duration=P1D
- logs for the postgres-04 (the gitlab.com secondary) https://log.gitlap.com/goto/2e01b81d05ebe0216a82a453fa6dd1b3
- output of the last failed attempt:
*** The primary geo node is `10.66.1.104` ***
*** Are you sure you want to continue (replicate/no)? ***
Confirmation: replicate
* Stopping PostgreSQL and all GitLab services
Enter the password for gitlab_repmgr@10.66.1.104:
* Checking for replication slot secondary_gprd
* Backing up postgresql.conf
* Moving old data directory to '/var/opt/gitlab/postgresql/data.1517535046'
* Starting base backup as the replicator user (gitlab_repmgr)
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
transaction log start point: 42ED/642D2858 on timeline 4
pg_basebackup: could not read COPY data: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
transaction log start point: 42ED/642D2858 on timeline 4
pg_basebackup: could not read COPY data: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
*** Initial replication failed! ***
Replication tool returned with a non zero exit status!
Troubleshooting tips:
- replication should be run by root user
- check your trust settings `md5_auth_cidr_addresses` in `gitlab.rb` on the primary node
Failed to execute: PGPASSFILE=/var/opt/gitlab/postgresql/.pgpass /opt/gitlab/embedded/bin/pg_basebackup -h 10.66.1.104 -p 5432 -D /var/opt/gitlab/postgresql/data -U gitlab_repmgr -v -x -P
Edited by John Jarvis