Unexpected writers to the WAL archive – something wrote WALs for the timeline 5 while gprd is still on 4

This issue like a potential danger to WAL-G archives – another machine wrote WALs with new timeline, 5, while the primary is still on timeline 4.

The gprd WAL-G archive contains WALs for the timeline 00000005:

The gprd primary is still in timeline 4:

patroni-v12-02-db-gprd.c.gitlab-production.internal:~$ sudo ls -la /var/opt/gitlab/postgresql/data12/pg_wal | tail -5
-rw-------  1 gitlab-psql gitlab-psql 16777216 May 18 08:07 000000040004E1E100000028
-rw-------  1 gitlab-psql gitlab-psql 16777216 May 18 08:07 000000040004E1E100000029
-rw-------  1 gitlab-psql gitlab-psql      140 May  8 14:46 00000004.history
drwx------  2 gitlab-psql gitlab-psql  6602752 May 18 08:09 archive_status
drwx------  3 gitlab-psql gitlab-psql     4096 May  8 11:13 .wal-g

Manually checking:

# wal-g wal-fetch 000000050004DB5A0000007E /tmp/000000050004DB5A0000007E
# ls -la /tmp/000000050004DB5A0000007E
-rw-r--r-- 1 root root 16777216 May 18 08:08 /tmp/000000050004DB5A0000007E
# wal-g wal-fetch 000000050004DB5A0000007F /tmp/000000050004DB5A0000007F
ERROR: 2021/05/18 08:12:13.536432 Archive '000000050004DB5A0000007F' does not exist.

-- it looks like 000000050004DB5A0000007E is the last in the phantom timeline 5.

Database Lab consumed these WALs, switching to timeline 5, following to the "phantom" source, replayed a few WALs and stuck not progressing anymore:

2021-05-17 13:27:33.616 UTC,,,12,,60a172c6.c,293496,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000040004DB5A00000067"" from archive",,,,,,,,,""
2021-05-17 13:27:34.203 UTC,,,12,,60a172c6.c,293497,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000040004DB5A00000068"" from archive",,,,,,,,,""
2021-05-17 13:27:34.276 UTC,,,12,,60a172c6.c,293498,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000040004DB5A00000069"" from archive",,,,,,,,,""
2021-05-17 13:27:34.367 UTC,,,12,,60a172c6.c,293499,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000040004DB5A0000006A"" from archive",,,,,,,,,""
2021-05-17 13:27:35.043 UTC,,,12,,60a172c6.c,293500,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A0000006B"" from archive",,,,,,,,,""
2021-05-17 13:27:36.159 UTC,,,12,,60a172c6.c,293501,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A0000006C"" from archive",,,,,,,,,""
2021-05-17 13:27:36.214 UTC,,,12,,60a172c6.c,293502,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A0000006D"" from archive",,,,,,,,,""
2021-05-17 13:27:36.269 UTC,,,12,,60a172c6.c,293503,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A0000006E"" from archive",,,,,,,,,""
2021-05-17 13:27:36.326 UTC,,,12,,60a172c6.c,293504,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A0000006F"" from archive",,,,,,,,,""
2021-05-17 13:27:36.384 UTC,,,12,,60a172c6.c,293505,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A00000070"" from archive",,,,,,,,,""
2021-05-17 13:27:36.443 UTC,,,12,,60a172c6.c,293506,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A00000071"" from archive",,,,,,,,,""
2021-05-17 13:27:36.498 UTC,,,12,,60a172c6.c,293507,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A00000072"" from archive",,,,,,,,,""
2021-05-17 13:27:36.566 UTC,,,12,,60a172c6.c,293508,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A00000073"" from archive",,,,,,,,,""
2021-05-17 13:27:36.627 UTC,,,12,,60a172c6.c,293509,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A00000074"" from archive",,,,,,,,,""
2021-05-17 13:27:36.693 UTC,,,12,,60a172c6.c,293510,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A00000075"" from archive",,,,,,,,,""
2021-05-17 13:27:36.750 UTC,,,12,,60a172c6.c,293511,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A00000076"" from archive",,,,,,,,,""
2021-05-17 13:27:36.808 UTC,,,12,,60a172c6.c,293512,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A00000077"" from archive",,,,,,,,,""
2021-05-17 13:27:36.885 UTC,,,12,,60a172c6.c,293513,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A00000078"" from archive",,,,,,,,,""
2021-05-17 13:27:36.954 UTC,,,12,,60a172c6.c,293514,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A00000079"" from archive",,,,,,,,,""
2021-05-17 13:27:37.021 UTC,,,12,,60a172c6.c,293515,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A0000007A"" from archive",,,,,,,,,""
2021-05-17 13:27:37.092 UTC,,,12,,60a172c6.c,293516,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A0000007B"" from archive",,,,,,,,,""
2021-05-17 13:27:37.469 UTC,,,12,,60a172c6.c,293517,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A0000007C"" from archive",,,,,,,,,""
2021-05-17 13:27:37.853 UTC,,,12,,60a172c6.c,293518,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A0000007D"" from archive",,,,,,,,,""
2021-05-17 13:27:38.422 UTC,,,12,,60a172c6.c,293519,,2021-05-16 19:30:14 UTC,1/0,0,LOG,00000,"restored log file ""000000050004DB5A0000007E"" from archive",,,,,,,,,""

-- @craig-gomes @abrandl I'm thinking of ways how to fix it, most likely we'll need to perform backup-fetch again (considering the other options like reverting back in time using ZFS snapshots and staying in timeline 4).

It looks like this "phantom" switch between timelines 4 and 5 happened at these LSNs: 000000040004DB5A0000006A -> 000000050004DB5A0000006B and then replaned till 000000050004DB5A0000007E, then stopped because the are no more WALs in the timeline 5.

The GCS contains both 000000040004DB5A0000006B (archived at May 16, 2021, 6:22:33 PM) and 000000050004DB5A0000006B (May 16, 2021, 6:29:12 PM). There is also 000000040004DB5A0000006B.partial in GCS (May 16, 2021, 6:29:09 PM)

The main question is: which machine wrote wrong WALs to GCS?

@ahmadsherif – this might be related to the risks I've explained https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13231 or some other activities with additional Postgres nodes recently created.

Edited by Nikolay Samokhvalov