Unexpected writers to the WAL archive – something wrote WALs for the timeline 5 while gprd is still on 4
This issue like a potential danger to WAL-G archives – another machine wrote WALs with new timeline, 5, while the primary is still on timeline 4.
The gprd WAL-G archive contains WALs for the timeline 00000005:
The gprd primary is still in timeline 4:
patroni-v12-02-db-gprd.c.gitlab-production.internal:~$ sudo ls -la /var/opt/gitlab/postgresql/data12/pg_wal | tail -5
-rw------- 1 gitlab-psql gitlab-psql 16777216 May 18 08:07 000000040004E1E100000028
-rw------- 1 gitlab-psql gitlab-psql 16777216 May 18 08:07 000000040004E1E100000029
-rw------- 1 gitlab-psql gitlab-psql 140 May 8 14:46 00000004.history
drwx------ 2 gitlab-psql gitlab-psql 6602752 May 18 08:09 archive_status
drwx------ 3 gitlab-psql gitlab-psql 4096 May 8 11:13 .wal-g
Manually checking:
# wal-g wal-fetch 000000050004DB5A0000007E /tmp/000000050004DB5A0000007E
# ls -la /tmp/000000050004DB5A0000007E
-rw-r--r-- 1 root root 16777216 May 18 08:08 /tmp/000000050004DB5A0000007E
# wal-g wal-fetch 000000050004DB5A0000007F /tmp/000000050004DB5A0000007F
ERROR: 2021/05/18 08:12:13.536432 Archive '000000050004DB5A0000007F' does not exist.
-- it looks like 000000050004DB5A0000007E is the last in the phantom timeline 5.
Database Lab consumed these WALs, switching to timeline 5, following to the "phantom" source, replayed a few WALs and stuck not progressing anymore:
2021-05-17 13:27:33.616 UTC,,,12,,60a172c6.c,293496,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000040004DB5A00000067"" from archive",,,,,,,,,""
2021-05-17 13:27:34.203 UTC,,,12,,60a172c6.c,293497,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000040004DB5A00000068"" from archive",,,,,,,,,""
2021-05-17 13:27:34.276 UTC,,,12,,60a172c6.c,293498,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000040004DB5A00000069"" from archive",,,,,,,,,""
2021-05-17 13:27:34.367 UTC,,,12,,60a172c6.c,293499,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000040004DB5A0000006A"" from archive",,,,,,,,,""
2021-05-17 13:27:35.043 UTC,,,12,,60a172c6.c,293500,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A0000006B"" from archive",,,,,,,,,""
2021-05-17 13:27:36.159 UTC,,,12,,60a172c6.c,293501,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A0000006C"" from archive",,,,,,,,,""
2021-05-17 13:27:36.214 UTC,,,12,,60a172c6.c,293502,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A0000006D"" from archive",,,,,,,,,""
2021-05-17 13:27:36.269 UTC,,,12,,60a172c6.c,293503,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A0000006E"" from archive",,,,,,,,,""
2021-05-17 13:27:36.326 UTC,,,12,,60a172c6.c,293504,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A0000006F"" from archive",,,,,,,,,""
2021-05-17 13:27:36.384 UTC,,,12,,60a172c6.c,293505,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A00000070"" from archive",,,,,,,,,""
2021-05-17 13:27:36.443 UTC,,,12,,60a172c6.c,293506,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A00000071"" from archive",,,,,,,,,""
2021-05-17 13:27:36.498 UTC,,,12,,60a172c6.c,293507,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A00000072"" from archive",,,,,,,,,""
2021-05-17 13:27:36.566 UTC,,,12,,60a172c6.c,293508,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A00000073"" from archive",,,,,,,,,""
2021-05-17 13:27:36.627 UTC,,,12,,60a172c6.c,293509,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A00000074"" from archive",,,,,,,,,""
2021-05-17 13:27:36.693 UTC,,,12,,60a172c6.c,293510,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A00000075"" from archive",,,,,,,,,""
2021-05-17 13:27:36.750 UTC,,,12,,60a172c6.c,293511,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A00000076"" from archive",,,,,,,,,""
2021-05-17 13:27:36.808 UTC,,,12,,60a172c6.c,293512,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A00000077"" from archive",,,,,,,,,""
2021-05-17 13:27:36.885 UTC,,,12,,60a172c6.c,293513,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A00000078"" from archive",,,,,,,,,""
2021-05-17 13:27:36.954 UTC,,,12,,60a172c6.c,293514,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A00000079"" from archive",,,,,,,,,""
2021-05-17 13:27:37.021 UTC,,,12,,60a172c6.c,293515,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A0000007A"" from archive",,,,,,,,,""
2021-05-17 13:27:37.092 UTC,,,12,,60a172c6.c,293516,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A0000007B"" from archive",,,,,,,,,""
2021-05-17 13:27:37.469 UTC,,,12,,60a172c6.c,293517,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A0000007C"" from archive",,,,,,,,,""
2021-05-17 13:27:37.853 UTC,,,12,,60a172c6.c,293518,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A0000007D"" from archive",,,,,,,,,""
2021-05-17 13:27:38.422 UTC,,,12,,60a172c6.c,293519,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A0000007E"" from archive",,,,,,,,,""
-- @craig-gomes @abrandl I'm thinking of ways how to fix it, most likely we'll need to perform backup-fetch again (considering the other options like reverting back in time using ZFS snapshots and staying in timeline 4).
It looks like this "phantom" switch between timelines 4 and 5 happened at these LSNs: 000000040004DB5A0000006A -> 000000050004DB5A0000006B and then replaned till 000000050004DB5A0000007E, then stopped because the are no more WALs in the timeline 5.
The GCS contains both 000000040004DB5A0000006B (archived at May 16, 2021, 6:22:33 PM) and 000000050004DB5A0000006B (May 16, 2021, 6:29:12 PM). There is also 000000040004DB5A0000006B.partial in GCS (May 16, 2021, 6:29:09 PM)
The main question is: which machine wrote wrong WALs to GCS?
@ahmadsherif – this might be related to the risks I've explained https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13231 or some other activities with additional Postgres nodes recently created.