Backups hang after major version upgrade

Summary

After the upgrade, the backup stopped to work.

Current Behaviour

Checking patroni process we identified that there was many wal-g backup-push running in parallel:

bash-4.4$ ps -ef --forest
UID        PID  PPID  C STIME TTY          TIME CMD
postgres 30383     0  0 15:55 pts/1    00:00:00 bash
postgres 32123 30383  0 15:56 pts/1    00:00:00  \_ ps -ef --forest
postgres 17856     0  0 12:18 ?        00:00:00 sh -e
postgres 17863 17856  0 12:18 ?        00:00:45  \_ wal-g backup-push /var/lib/postgresql/data -f
postgres  9108     0  0 08:17 ?        00:00:00 sh -e
postgres  9115  9108  0 08:17 ?        00:00:45  \_ wal-g backup-push /var/lib/postgresql/data -f
postgres  8568     0  0 04:15 ?        00:00:00 sh -e
postgres  8574  8568  0 04:15 ?        00:00:48  \_ wal-g backup-push /var/lib/postgresql/data -f -p
postgres   975     0  0 04:11 pts/0    00:00:01 /bin/sh -c while true; do sleep 10; done
postgres 32120   975  0 15:56 pts/0    00:00:00  \_ /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 10
postgres   511     0  0 04:11 ?        00:00:00 /bin/bash /entrypoint
postgres   521   511  0 04:11 ?        00:00:08  \_ pgbouncer /etc/pgbouncer/pgbouncer.ini
postgres   484     0  0 04:11 ?        00:01:25 /app/stackgres-cluster-controller -Dquarkus.http.host=0.0.0.0 -Dquarkus.http.port=8080 -Dquarkus.http.ssl-port=8443 -Djava.util.logging.manager=org.jboss.logmanager.LogManager
postgres   467     0  0 04:11 ?        00:00:03 /bin/sh -exc run_postgres_exporter() {   set -x   exec /usr/local/bin/postgres_exporter \     --log.level=info }  set +x while true do   if ( [ -z "$PID" ] || [ ! -d "/proc/$PID" ]
postgres   968   467  0 04:11 ?        00:01:04  \_ /usr/local/bin/postgres_exporter --log.level=info
postgres 32119   467  0 15:56 ?        00:00:00  \_ /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 5
postgres   445     0  0 04:11 ?        00:03:56 /usr/local/bin/envoy -c /etc/envoy/default_envoy.json
postgres   434     0  0 04:11 ?        00:00:53 /usr/bin/python3.6 /usr/bin/patroni /etc/patroni/postgres.yml
postgres     1     0  0 04:08 ?        00:00:00 /pause
postgres   524     1  0 04:11 ?        00:00:10 /usr/lib/postgresql/14.0/bin/postgres -D /var/lib/postgresql/data --config-file=/var/lib/postgresql/data/postgresql.conf --listen_addresses=127.0.0.1 --port=5432 --cluster_name=pg12
postgres   529   524  0 04:11 ?        00:00:00  \_ postgres: pg12: checkpointer
postgres   530   524  0 04:11 ?        00:00:00  \_ postgres: pg12: background writer
postgres   531   524  0 04:11 ?        00:00:04  \_ postgres: pg12: stats collector
postgres   533   524  0 04:11 ?        00:00:06  \_ postgres: pg12: postgres postgres [local] idle
postgres   538   524  0 04:11 ?        00:00:02  \_ postgres: pg12: walwriter
postgres   539   524  0 04:11 ?        00:00:00  \_ postgres: pg12: autovacuum launcher
postgres   540   524  0 04:11 ?        00:00:00  \_ postgres: pg12: archiver failed on 00000002.history
postgres   541   524  0 04:11 ?        00:00:00  \_ postgres: pg12: logical replication launcher
postgres  1410   524  0 04:11 ?        00:00:44  \_ postgres: pg12: postgres postgres [local] idle
postgres  5132   524  0 04:14 ?        00:00:00  \_ postgres: pg12: walsender replicator 127.0.0.1(37806) streaming 3/950000F8
postgres  8647   524  0 04:15 ?        00:00:01  \_ postgres: pg12: postgres postgres [local] SELECT
postgres  9189   524  0 08:17 ?        00:00:00  \_ postgres: pg12: postgres postgres [local] SELECT
postgres 17937   524  0 12:18 ?        00:00:00  \_ postgres: pg12: postgres postgres [local] SELECT

Also noted that the archiver is failing.

Looking in the backup pod, its waiting for the pg_stop_backup to finish:

❯ kubectl exec -it pg12-backup-1634214600-xqjlx -- bash
bash-4.4$ cat /tmp/backup-push
INFO: 2021/10/14 16:06:39.937875 Selecting the latest backup as the base for the current delta backup...
INFO: 2021/10/14 16:06:40.500230 Doing full backup.
INFO: 2021/10/14 16:06:40.508471 Calling pg_start_backup()
INFO: 2021/10/14 16:06:40.600101 Starting a new tar bundle
INFO: 2021/10/14 16:06:40.600138 Walking ...
INFO: 2021/10/14 16:06:40.600335 Starting part 1 ...
INFO: 2021/10/14 16:06:44.267726 Finished writing part 1.
INFO: 2021/10/14 16:06:44.267765 Starting part 2 ...
INFO: 2021/10/14 16:06:47.856170 Finished writing part 2.
INFO: 2021/10/14 16:06:47.856203 Starting part 3 ...
INFO: 2021/10/14 16:06:51.085615 Finished writing part 3.
INFO: 2021/10/14 16:06:51.085661 Starting part 4 ...
INFO: 2021/10/14 16:06:54.292075 Finished writing part 4.
INFO: 2021/10/14 16:06:54.292145 Starting part 5 ...
INFO: 2021/10/14 16:06:57.552249 Finished writing part 5.
INFO: 2021/10/14 16:06:57.552292 Starting part 6 ...
INFO: 2021/10/14 16:07:00.656844 Finished writing part 6.
INFO: 2021/10/14 16:07:00.656892 Starting part 7 ...
INFO: 2021/10/14 16:07:06.989597 Finished writing part 7.
INFO: 2021/10/14 16:07:06.989634 Starting part 8 ...
INFO: 2021/10/14 16:07:07.602336 Packing ...
INFO: 2021/10/14 16:07:07.602658 Finished writing part 8.
INFO: 2021/10/14 16:07:08.070507 Starting part 9 ...
INFO: 2021/10/14 16:07:08.070551 /global/pg_control
INFO: 2021/10/14 16:07:08.074790 Finished writing part 9.
INFO: 2021/10/14 16:07:08.075768 Calling pg_stop_backup()

When checking postgres, it says that is waiting for the BackupWaitWalArchive event:

postgres=# \x
Expanded display is on.
postgres=# select * from pg_stat_activity  where query ~* 'pg_stop_backup' and pid != pg_backend_pid();
-[ RECORD 1 ]----+-------------------------------------------------------------
datid            | 16405
datname          | postgres
pid              | 15811
leader_pid       |
usesysid         | 10
usename          | postgres
application_name |
client_addr      |
client_hostname  |
client_port      | -1
backend_start    | 2021-10-14 16:06:40.501405+00
xact_start       | 2021-10-14 16:07:08.075822+00
query_start      | 2021-10-14 16:07:08.07624+00
state_change     | 2021-10-14 16:07:08.076241+00
wait_event_type  | IPC
wait_event       | BackupWaitWalArchive
state            | active
backend_xid      |
backend_xmin     | 16396
query_id         |
query            | SELECT labelfile, spcmapfile, lsn FROM pg_stop_backup(false)
backend_type     | client backend
-[ RECORD 2 ]----+-------------------------------------------------------------
datid            | 16405
datname          | postgres
pid              | 8647
leader_pid       |
usesysid         | 10
usename          | postgres
application_name |
client_addr      |
client_hostname  |
client_port      | -1
backend_start    | 2021-10-14 04:15:55.43624+00
xact_start       | 2021-10-14 04:16:23.642652+00
query_start      | 2021-10-14 04:16:23.643071+00
state_change     | 2021-10-14 04:16:23.643071+00
wait_event_type  | IPC
wait_event       | BackupWaitWalArchive
state            | active
backend_xid      |
backend_xmin     | 14975
query_id         |
query            | SELECT labelfile, spcmapfile, lsn FROM pg_stop_backup(false)
backend_type     | client backend
-[ RECORD 3 ]----+-------------------------------------------------------------
datid            | 16405
datname          | postgres
pid              | 17937
leader_pid       |
usesysid         | 10
usename          | postgres
application_name |
client_addr      |
client_hostname  |
client_port      | -1
backend_start    | 2021-10-14 12:18:45.653528+00
xact_start       | 2021-10-14 12:19:13.022957+00
query_start      | 2021-10-14 12:19:13.023389+00
state_change     | 2021-10-14 12:19:13.02339+00
wait_event_type  | IPC
wait_event       | BackupWaitWalArchive
state            | active
backend_xid      |
backend_xmin     | 15940
query_id         |
query            | SELECT labelfile, spcmapfile, lsn FROM pg_stop_backup(false)
backend_type     | client backend
-[ RECORD 4 ]----+-------------------------------------------------------------
datid            | 16405
datname          | postgres
pid              | 9189
leader_pid       |
usesysid         | 10
usename          | postgres
application_name |
client_addr      |
client_hostname  |
client_port      | -1
backend_start    | 2021-10-14 08:17:33.167643+00
xact_start       | 2021-10-14 08:17:59.863883+00
query_start      | 2021-10-14 08:17:59.864545+00
state_change     | 2021-10-14 08:17:59.864545+00
wait_event_type  | IPC
wait_event       | BackupWaitWalArchive
state            | active
backend_xid      |
backend_xmin     | 15458
query_id         |
query            | SELECT labelfile, spcmapfile, lsn FROM pg_stop_backup(false)
backend_type     | client backend

I will try to get logs to figure out the main cause

Steps to reproduce

Created a cluster with pg 12 called pg12
benchmark with dbops to create db of ~8gb
checked if monitoring worked properly
manually create a backup of pg12
upgrade cluster pg12 to pg 13 using dbops
restore the original backup called from pg12-before-upgrade
Upgrade pg12 from 13.0 to pg13.4 using dbops minor upgrade
Restarted pg12 to apply new minor version
tested connection using psql
upgraded pg12 to 14.0
Run another backup from pg 14 - didn't finished

Expected Behaviour

Possible Solution

Environment

StackGres version:

❯ kubectl get deployments -n stackgres stackgres-operator --template '{{ printf "%s\n" (index .spec.template.spec.containers 0).image }}'
stackgres/operator:1.0.0

Kubernetes version:

❯ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.0", GitCommit:"c2b5237ccd9c0f1d600d3072634ca66cefdf272f", GitTreeState:"clean", BuildDate:"2021-08-04T17:56:19Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.7-eks-d88609", GitCommit:"d886092805d5cc3a47ed5cf0c43de38ce442dfcb", GitTreeState:"clean", BuildDate:"2021-07-31T00:29:12Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.22) and server (1.20) exceeds the supported minor version skew of +/-1

Cloud provider or hardware configuration: AWS

Relevant logs and/or screenshots

Edited Jan 17, 2022 by Matteo Melli