Backups hang after major version upgrade
Summary
After the upgrade, the backup stopped to work.
Current Behaviour
Checking patroni process we identified that there was many wal-g backup-push
running in parallel:
bash-4.4$ ps -ef --forest
UID PID PPID C STIME TTY TIME CMD
postgres 30383 0 0 15:55 pts/1 00:00:00 bash
postgres 32123 30383 0 15:56 pts/1 00:00:00 \_ ps -ef --forest
postgres 17856 0 0 12:18 ? 00:00:00 sh -e
postgres 17863 17856 0 12:18 ? 00:00:45 \_ wal-g backup-push /var/lib/postgresql/data -f
postgres 9108 0 0 08:17 ? 00:00:00 sh -e
postgres 9115 9108 0 08:17 ? 00:00:45 \_ wal-g backup-push /var/lib/postgresql/data -f
postgres 8568 0 0 04:15 ? 00:00:00 sh -e
postgres 8574 8568 0 04:15 ? 00:00:48 \_ wal-g backup-push /var/lib/postgresql/data -f -p
postgres 975 0 0 04:11 pts/0 00:00:01 /bin/sh -c while true; do sleep 10; done
postgres 32120 975 0 15:56 pts/0 00:00:00 \_ /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 10
postgres 511 0 0 04:11 ? 00:00:00 /bin/bash /entrypoint
postgres 521 511 0 04:11 ? 00:00:08 \_ pgbouncer /etc/pgbouncer/pgbouncer.ini
postgres 484 0 0 04:11 ? 00:01:25 /app/stackgres-cluster-controller -Dquarkus.http.host=0.0.0.0 -Dquarkus.http.port=8080 -Dquarkus.http.ssl-port=8443 -Djava.util.logging.manager=org.jboss.logmanager.LogManager
postgres 467 0 0 04:11 ? 00:00:03 /bin/sh -exc run_postgres_exporter() { set -x exec /usr/local/bin/postgres_exporter \ --log.level=info } set +x while true do if ( [ -z "$PID" ] || [ ! -d "/proc/$PID" ]
postgres 968 467 0 04:11 ? 00:01:04 \_ /usr/local/bin/postgres_exporter --log.level=info
postgres 32119 467 0 15:56 ? 00:00:00 \_ /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 5
postgres 445 0 0 04:11 ? 00:03:56 /usr/local/bin/envoy -c /etc/envoy/default_envoy.json
postgres 434 0 0 04:11 ? 00:00:53 /usr/bin/python3.6 /usr/bin/patroni /etc/patroni/postgres.yml
postgres 1 0 0 04:08 ? 00:00:00 /pause
postgres 524 1 0 04:11 ? 00:00:10 /usr/lib/postgresql/14.0/bin/postgres -D /var/lib/postgresql/data --config-file=/var/lib/postgresql/data/postgresql.conf --listen_addresses=127.0.0.1 --port=5432 --cluster_name=pg12
postgres 529 524 0 04:11 ? 00:00:00 \_ postgres: pg12: checkpointer
postgres 530 524 0 04:11 ? 00:00:00 \_ postgres: pg12: background writer
postgres 531 524 0 04:11 ? 00:00:04 \_ postgres: pg12: stats collector
postgres 533 524 0 04:11 ? 00:00:06 \_ postgres: pg12: postgres postgres [local] idle
postgres 538 524 0 04:11 ? 00:00:02 \_ postgres: pg12: walwriter
postgres 539 524 0 04:11 ? 00:00:00 \_ postgres: pg12: autovacuum launcher
postgres 540 524 0 04:11 ? 00:00:00 \_ postgres: pg12: archiver failed on 00000002.history
postgres 541 524 0 04:11 ? 00:00:00 \_ postgres: pg12: logical replication launcher
postgres 1410 524 0 04:11 ? 00:00:44 \_ postgres: pg12: postgres postgres [local] idle
postgres 5132 524 0 04:14 ? 00:00:00 \_ postgres: pg12: walsender replicator 127.0.0.1(37806) streaming 3/950000F8
postgres 8647 524 0 04:15 ? 00:00:01 \_ postgres: pg12: postgres postgres [local] SELECT
postgres 9189 524 0 08:17 ? 00:00:00 \_ postgres: pg12: postgres postgres [local] SELECT
postgres 17937 524 0 12:18 ? 00:00:00 \_ postgres: pg12: postgres postgres [local] SELECT
Also noted that the archiver is failing.
Looking in the backup pod, its waiting for the pg_stop_backup to finish:
❯ kubectl exec -it pg12-backup-1634214600-xqjlx -- bash
bash-4.4$ cat /tmp/backup-push
INFO: 2021/10/14 16:06:39.937875 Selecting the latest backup as the base for the current delta backup...
INFO: 2021/10/14 16:06:40.500230 Doing full backup.
INFO: 2021/10/14 16:06:40.508471 Calling pg_start_backup()
INFO: 2021/10/14 16:06:40.600101 Starting a new tar bundle
INFO: 2021/10/14 16:06:40.600138 Walking ...
INFO: 2021/10/14 16:06:40.600335 Starting part 1 ...
INFO: 2021/10/14 16:06:44.267726 Finished writing part 1.
INFO: 2021/10/14 16:06:44.267765 Starting part 2 ...
INFO: 2021/10/14 16:06:47.856170 Finished writing part 2.
INFO: 2021/10/14 16:06:47.856203 Starting part 3 ...
INFO: 2021/10/14 16:06:51.085615 Finished writing part 3.
INFO: 2021/10/14 16:06:51.085661 Starting part 4 ...
INFO: 2021/10/14 16:06:54.292075 Finished writing part 4.
INFO: 2021/10/14 16:06:54.292145 Starting part 5 ...
INFO: 2021/10/14 16:06:57.552249 Finished writing part 5.
INFO: 2021/10/14 16:06:57.552292 Starting part 6 ...
INFO: 2021/10/14 16:07:00.656844 Finished writing part 6.
INFO: 2021/10/14 16:07:00.656892 Starting part 7 ...
INFO: 2021/10/14 16:07:06.989597 Finished writing part 7.
INFO: 2021/10/14 16:07:06.989634 Starting part 8 ...
INFO: 2021/10/14 16:07:07.602336 Packing ...
INFO: 2021/10/14 16:07:07.602658 Finished writing part 8.
INFO: 2021/10/14 16:07:08.070507 Starting part 9 ...
INFO: 2021/10/14 16:07:08.070551 /global/pg_control
INFO: 2021/10/14 16:07:08.074790 Finished writing part 9.
INFO: 2021/10/14 16:07:08.075768 Calling pg_stop_backup()
When checking postgres, it says that is waiting for the BackupWaitWalArchive
event:
postgres=# \x
Expanded display is on.
postgres=# select * from pg_stat_activity where query ~* 'pg_stop_backup' and pid != pg_backend_pid();
-[ RECORD 1 ]----+-------------------------------------------------------------
datid | 16405
datname | postgres
pid | 15811
leader_pid |
usesysid | 10
usename | postgres
application_name |
client_addr |
client_hostname |
client_port | -1
backend_start | 2021-10-14 16:06:40.501405+00
xact_start | 2021-10-14 16:07:08.075822+00
query_start | 2021-10-14 16:07:08.07624+00
state_change | 2021-10-14 16:07:08.076241+00
wait_event_type | IPC
wait_event | BackupWaitWalArchive
state | active
backend_xid |
backend_xmin | 16396
query_id |
query | SELECT labelfile, spcmapfile, lsn FROM pg_stop_backup(false)
backend_type | client backend
-[ RECORD 2 ]----+-------------------------------------------------------------
datid | 16405
datname | postgres
pid | 8647
leader_pid |
usesysid | 10
usename | postgres
application_name |
client_addr |
client_hostname |
client_port | -1
backend_start | 2021-10-14 04:15:55.43624+00
xact_start | 2021-10-14 04:16:23.642652+00
query_start | 2021-10-14 04:16:23.643071+00
state_change | 2021-10-14 04:16:23.643071+00
wait_event_type | IPC
wait_event | BackupWaitWalArchive
state | active
backend_xid |
backend_xmin | 14975
query_id |
query | SELECT labelfile, spcmapfile, lsn FROM pg_stop_backup(false)
backend_type | client backend
-[ RECORD 3 ]----+-------------------------------------------------------------
datid | 16405
datname | postgres
pid | 17937
leader_pid |
usesysid | 10
usename | postgres
application_name |
client_addr |
client_hostname |
client_port | -1
backend_start | 2021-10-14 12:18:45.653528+00
xact_start | 2021-10-14 12:19:13.022957+00
query_start | 2021-10-14 12:19:13.023389+00
state_change | 2021-10-14 12:19:13.02339+00
wait_event_type | IPC
wait_event | BackupWaitWalArchive
state | active
backend_xid |
backend_xmin | 15940
query_id |
query | SELECT labelfile, spcmapfile, lsn FROM pg_stop_backup(false)
backend_type | client backend
-[ RECORD 4 ]----+-------------------------------------------------------------
datid | 16405
datname | postgres
pid | 9189
leader_pid |
usesysid | 10
usename | postgres
application_name |
client_addr |
client_hostname |
client_port | -1
backend_start | 2021-10-14 08:17:33.167643+00
xact_start | 2021-10-14 08:17:59.863883+00
query_start | 2021-10-14 08:17:59.864545+00
state_change | 2021-10-14 08:17:59.864545+00
wait_event_type | IPC
wait_event | BackupWaitWalArchive
state | active
backend_xid |
backend_xmin | 15458
query_id |
query | SELECT labelfile, spcmapfile, lsn FROM pg_stop_backup(false)
backend_type | client backend
I will try to get logs to figure out the main cause
Steps to reproduce
- Created a cluster with pg 12 called pg12
- benchmark with dbops to create db of ~8gb
- checked if monitoring worked properly
- manually create a backup of pg12
- upgrade cluster pg12 to pg 13 using dbops
- restore the original backup called from pg12-before-upgrade
- Upgrade pg12 from 13.0 to pg13.4 using dbops minor upgrade
- Restarted pg12 to apply new minor version
- tested connection using psql
- upgraded pg12 to 14.0
- Run another backup from pg 14 - didn't finished
Expected Behaviour
Possible Solution
Environment
- StackGres version:
❯ kubectl get deployments -n stackgres stackgres-operator --template '{{ printf "%s\n" (index .spec.template.spec.containers 0).image }}'
stackgres/operator:1.0.0
- Kubernetes version:
❯ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.0", GitCommit:"c2b5237ccd9c0f1d600d3072634ca66cefdf272f", GitTreeState:"clean", BuildDate:"2021-08-04T17:56:19Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.7-eks-d88609", GitCommit:"d886092805d5cc3a47ed5cf0c43de38ce442dfcb", GitTreeState:"clean", BuildDate:"2021-07-31T00:29:12Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.22) and server (1.20) exceeds the supported minor version skew of +/-1
- Cloud provider or hardware configuration: AWS
Relevant logs and/or screenshots
Edited by Matteo Melli