Backup system and replication is not working
The secondary database is not reporting to be ready for replication:
2017-02-11_16:55:23.68447 db2 postgresql:
2017-02-11_16:55:24.88501 db2 postgresql: 2017-02-11 16:55:24 GMT [14157]: [1-1] FATAL: the database system is starting up
2017-02-11_16:55:26.98179 db2 postgresql: 2017-02-11 16:55:26 GMT [14158]: [1-1] FATAL: the database system is starting up
2017-02-11_16:55:27.74941 db2 postgresql: 2017-02-11 16:55:27 GMT [14159]: [1-1] FATAL: the database system is starting up
2017-02-11_16:55:28.69045 db2 postgresql: 2017-02-11 16:55:28 GMT [14160]: [1-1] LOG: started streaming WAL from primary at A74/A1000000 on timeline 2
2017-02-11_16:55:28.69066 db2 postgresql: 2017-02-11 16:55:28 GMT [14160]: [2-1] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000200000A74000000A1 has already been removed
2017-02-11_16:55:28.69091 db2 postgresql:
2017-02-11_16:55:29.95074 db2 postgresql: 2017-02-11 16:55:29 GMT [14162]: [1-1] FATAL: the database system is starting up
2017-02-11_16:55:29.95187 db2 postgresql: 2017-02-11 16:55:29 GMT [14163]: [1-1] FATAL: the database system is starting up
2017-02-11_16:55:29.95294 db2 postgresql: 2017-02-11 16:55:29 GMT [14164]: [1-1] FATAL: the database system is starting up
2017-02-11_16:55:29.95390 db2 postgresql: 2017-02-11 16:55:29 GMT [14165]: [1-1] FATAL: the database system is starting up
2017-02-11_16:55:32.94563 db2 postgresql: 2017-02-11 16:55:32 GMT [14166]: [1-1] FATAL: the database system is starting up
2017-02-11_16:55:33.02936 db2 postgresql: 2017-02-11 16:55:33 GMT [14168]: [1-1] FATAL: the database system is starting up
2017-02-11_16:55:33.03033 db2 postgresql: 2017-02-11 16:55:33 GMT [14169]: [1-1] FATAL: the database system is starting up
2017-02-11_16:55:33.03135 db2 postgresql: 2017-02-11 16:55:33 GMT [14170]: [1-1] FATAL: the database system is starting up
2017-02-11_16:55:33.03227 db2 postgresql: 2017-02-11 16:55:33 GMT [14171]: [1-1] FATAL: the database system is starting up
2017-02-11_16:55:33.69435 db2 postgresql: 2017-02-11 16:55:33 GMT [14172]: [1-1] LOG: started streaming WAL from primary at A74/A1000000 on timeline 2
2017-02-11_16:55:33.69459 db2 postgresql: 2017-02-11 16:55:33 GMT [14172]: [2-1] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000200000A74000000A1 has already been removed
And our backups are simply not working
On the other hand, the way the backups are setup now is by using a host that should be following the primary database, but it is also broken. And this solution was not agreed by the team.
This way of solving things do not work and do not scale. We need to seriously stop and think to really solve things and not just throw more hosts into a brittle system that does not seem to work at all or that just breaks with a light breeze.
In addition to this there is no documentation whatsoever and to add insult to injury the alert of the backup is too old points to an nonexistent documentation page.
I just had to reverse engineer chef to understand where the backups and how are they working. This is not the way to do things.
Let's meet on Monday and discuss how is this actually gonna work because so far this is just collapsing on itself.
cc/ @gl-infra