pg_rewind/pg_basebackup is waiting on a checkpoint from the master
On a recent e2e test on EKS, it was found that after killing the master, another pod was elected master and the new pod (attached to the PV of the old master) was not being brought up. Upon container log inspection, it was found waiting on pg_rewind
which in turn was waiting on a checkpoint from the new master. The probe was reporting the container as unhealthy.
With a manual CHECKPOINT
command on the master, the (new replica, old master) was able to proceed successfully. So we need to answer the following questions:
- Does
pg_rewind
needs to forcibly wait for a checkpoint on the master? - Is there any way for
pg_rewind
to for the checkpoint by itself? - Any other better alternative than forcing very low checkpoint_timeout on our end?
The same effect may happen when initializing a replica from a pg_basebackup
.
Edited by Alvaro Hernandez