Timeout during `pg_basebackup: initiating base backup, waiting for checkpoint to complete`
Problem
Intermittent error during Geo set up.
https://gitlab.com/gitlab-org/geo-team/geo-ci/-/jobs/6283577396:
TASK [gitlab_geo : Secondary Database - Replicate geo database] ****************
fatal: [ci-1k-eu-west2-gitlab-rails-1]: FAILED! => changed=true
cmd: |-
gitlab-ctl replicate-geo-database \
--slot-name=ci_1k_eu_west2_gitlab_rails_1 \
--host=10.168.0.122 \
--sslmode=verify-ca \
--force \
--skip-backup
delta: '0:05:30.440482'
end: '2024-02-29 02:07:54.372437'
msg: command exceeded timeout
rc: null
start: '2024-02-29 02:02:23.931955'
stdout: |-
No user created projects. Database not active
[33m---------------------------------------------------------------[0m
[33mWARNING: Make sure this script is run from the secondary server[0m
[33m---------------------------------------------------------------[0m
[33m*** You are about to delete your local PostgreSQL database, and replicate the primary database. ***[0m
[33m*** The primary geo node is `10.168.0.122` ***[0m
[33m*** Are you sure you want to continue (replicate/no)? ***[0m
Confirmation: Enter the password for gitlab_replicator@10.168.0.122:
[32m* Stopping PostgreSQL and all GitLab services[0m
[32m* Checking for replication slot ci_1k_eu_west2_gitlab_rails_1[0m
[32m* Creating replication slot ci_1k_eu_west2_gitlab_rails_1[0m
[32m* Backing up postgresql.conf[0m
[32m* Moving old data directory to '/var/opt/gitlab/postgresql/data.1709172195'[0m
[32m* Starting base backup as the replicator user (gitlab_replicator)[0m
pg_basebackup: initiating base backup, waiting for checkpoint to complete
Additional details
omnibus-gitlab!7452 (comment 1796024909):
pg_basebackup: initiating base backup, waiting for checkpoint to completehttps://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-CHECKPOINT-TIMEOUT
Maximum time between automatic WAL checkpoints...The default is five minutes (5min).
This step alone can take up to 5 minutes if nothing is happening on the primary PG DB.
A search for
checkpoint_timeoutin the Omnibus GitLab codebase:If I understand correctly, without Patroni, we set checkpoint timeout to 5 minutes, and with Patroni, 30s (seems aggressive to use the minimum, no?).
Postgres checkpoints and how to tune them
I can't personally justify reducing Omnibus GitLab's PG checkpoint timeout just for the purpose of avoiding this Ansible timeout. If I understand correctly, doing so could cause a problem for some existing GitLab environments. Also I can imagine that adding a secondary Geo site to a large GitLab instance could take more than 5m30s to replicate PG data.
Possible solutions
@rmarshall @nwestbury @ibaum Does it sound ok to increase the Ansible timeout for that command from 5 minutes to say, 10 minutes?
(I opened gitlab-environment-toolkit!1268 (merged) proposing to bump it to 15m.)
@mkozono I agree with increasing the timeout in ansible, and I would also advocate for making it configurable.
Maybe in parallel we should consider setting
--checkpoint=fastin the replication command? Or perhaps making it an option. Higher I/O cost, so maybe less desirable when adding a secondary to a live system. But for a new/empty system, it could save some time.