Timeout during `pg_basebackup: initiating base backup, waiting for checkpoint to complete`

Problem

Intermittent error during Geo set up.

https://gitlab.com/gitlab-org/geo-team/geo-ci/-/jobs/6283577396:

TASK [gitlab_geo : Secondary Database - Replicate geo database] ****************
fatal: [ci-1k-eu-west2-gitlab-rails-1]: FAILED! => changed=true 
  cmd: |-
    gitlab-ctl replicate-geo-database \
      --slot-name=ci_1k_eu_west2_gitlab_rails_1 \
      --host=10.168.0.122 \
      --sslmode=verify-ca \
      --force \
      --skip-backup
  delta: '0:05:30.440482'
  end: '2024-02-29 02:07:54.372437'
  msg: command exceeded timeout
  rc: null
  start: '2024-02-29 02:02:23.931955'
  stdout: |-
    No user created projects. Database not active
  
    [33m---------------------------------------------------------------[0m
    [33mWARNING: Make sure this script is run from the secondary server[0m
    [33m---------------------------------------------------------------[0m
  
    [33m*** You are about to delete your local PostgreSQL database, and replicate the primary database. ***[0m
    [33m*** The primary geo node is `10.168.0.122` ***[0m
  
    [33m*** Are you sure you want to continue (replicate/no)? ***[0m
    Confirmation: Enter the password for gitlab_replicator@10.168.0.122:
    [32m* Stopping PostgreSQL and all GitLab services[0m
    [32m* Checking for replication slot ci_1k_eu_west2_gitlab_rails_1[0m
    [32m* Creating replication slot ci_1k_eu_west2_gitlab_rails_1[0m
    [32m* Backing up postgresql.conf[0m
    [32m* Moving old data directory to '/var/opt/gitlab/postgresql/data.1709172195'[0m
    [32m* Starting base backup as the replicator user (gitlab_replicator)[0m
    pg_basebackup: initiating base backup, waiting for checkpoint to complete

Additional details

omnibus-gitlab!7452 (comment 1796024909):

pg_basebackup: initiating base backup, waiting for checkpoint to complete

https://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-CHECKPOINT-TIMEOUT

Maximum time between automatic WAL checkpoints...The default is five minutes (5min).

This step alone can take up to 5 minutes if nothing is happening on the primary PG DB.

A search for checkpoint_timeout in the Omnibus GitLab codebase:

If I understand correctly, without Patroni, we set checkpoint timeout to 5 minutes, and with Patroni, 30s (seems aggressive to use the minimum, no?).

Postgres checkpoints and how to tune them

I can't personally justify reducing Omnibus GitLab's PG checkpoint timeout just for the purpose of avoiding this Ansible timeout. If I understand correctly, doing so could cause a problem for some existing GitLab environments. Also I can imagine that adding a secondary Geo site to a large GitLab instance could take more than 5m30s to replicate PG data.

Possible solutions

@rmarshall @nwestbury @ibaum Does it sound ok to increase the Ansible timeout for that command from 5 minutes to say, 10 minutes?

(I opened gitlab-environment-toolkit!1268 (merged) proposing to bump it to 15m.)

@mkozono I agree with increasing the timeout in ansible, and I would also advocate for making it configurable.

Maybe in parallel we should consider setting --checkpoint=fast in the replication command? Or perhaps making it an option. Higher I/O cost, so maybe less desirable when adding a secondary to a live system. But for a new/empty system, it could save some time.

https://www.postgresql.org/docs/14/app-pgbasebackup.html

https://www.postgresql.org/docs/14/continuous-archiving.html#BACKUP-LOWLEVEL-BASE-BACKUP

Edited Feb 29, 2024 by Michael Kozono