Not recover a node after failover
After failover the SG does not recover a node:
anthony@anthony-HP-EliteBook-840-G4:~/Trabajo/OnGres/ongres_repo/cope/deploy/pgbouncer_upgrade$ kubectl exec -it -n sg-dev mminventory-pro-dev-1 -c patroni -- bash
bash-4.4$ patronictl list
+ Cluster: mminventory-pro-dev (6863135752587546694) --+--------------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------------------+---------------------+--------+--------------+----+-----------+
| mminventory-pro-dev-0 | 10.192.147.101:7433 | | running | 7 | 0 |
| mminventory-pro-dev-1 | 10.192.146.100:7433 | Leader | running | 7 | |
| mminventory-pro-dev-2 | 10.192.146.226:7433 | | start failed | | unknown |
pod mminventory-pro-dev-2 postgresql's logs:
bash-4.4$ tail -f postgres-25.csv
2020-09-11 14:25:47.850 UTC,,,1795,,5f5b88eb.703,4,,2020-09-11 14:25:47 UTC,,0,LOG,00000,"database system is shut down",,,,,,,,,""
2020-09-11 14:25:59.285 UTC,,,1846,,5f5b88f7.736,1,,2020-09-11 14:25:59 UTC,,0,LOG,00000,"ending log output to stderr",,"Future log output will go to log destination ""csvlog"".",,,,,,,""
2020-09-11 14:25:59.289 UTC,,,1849,,5f5b88f7.739,1,,2020-09-11 14:25:59 UTC,,0,LOG,00000,"database system was shut down in recovery at 2020-09-11 06:15:09 UTC",,,,,,,,,""
2020-09-11 14:25:59.290 UTC,,,1849,,5f5b88f7.739,2,,2020-09-11 14:25:59 UTC,,0,LOG,00000,"entering standby mode",,,,,,,,,""
2020-09-11 14:25:59.290 UTC,,,1850,"[local]",5f5b88f7.73a,1,"",2020-09-11 14:25:59 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,""
2020-09-11 14:25:59.290 UTC,"postgres","postgres",1850,"[local]",5f5b88f7.73a,2,"",2020-09-11 14:25:59 UTC,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,""
2020-09-11 14:25:59.290 UTC,,,1849,,5f5b88f7.739,3,,2020-09-11 14:25:59 UTC,,0,FATAL,XX000,"requested timeline 7 is not a child of this server's history","Latest checkpoint is at 16/F6000028 on timeline 6, but in the history of the requested timeline, the server forked off from that timeline at 16/F50016B8.",,,,,,,,""
2020-09-11 14:25:59.291 UTC,,,1846,,5f5b88f7.736,2,,2020-09-11 14:25:59 UTC,,0,LOG,00000,"startup process (PID 1849) exited with exit code 1",,,,,,,,,""
2020-09-11 14:25:59.291 UTC,,,1846,,5f5b88f7.736,3,,2020-09-11 14:25:59 UTC,,0,LOG,00000,"aborting startup due to startup process failure",,,,,,,,,""
2020-09-11 14:25:59.342 UTC,,,1846,,5f5b88f7.736,4,,2020-09-11 14:25:59 UTC,,0,LOG,00000,"database system is shut down",,,,,,,,,""
Timelines
# master
postgres=# select substring(pg_walfile_name(pg_current_wal_lsn()), 1, 8);
substring
-----------
00000007
(1 row)
# replica
bash-4.4$ psql "dbname=postgres replication=database" -c "IDENTIFY_SYSTEM;"
systemid | timeline | xlogpos | dbname
---------------------+----------+-------------+----------
6863135752587546694 | 7 | 17/26444380 | postgres
(1 row)
Solution: reinit the node using patronictl:
patronictl reinit mminventory-pro-dev mminventory-pro-dev-2
Versions: Kubernetes version 1.16 y StackGres version: 0.9 .
Edited by Matteo Melli