Fix these multiple wal-g backup-push failures
Tonight we had GCS failures very often which showed multiple problems with wal-g backup-push (3 gprd backup runs failed tonight). Reminder: We have GCS_CONTEXT_TIMEOUT set to 600s currently, to make wal-push not block for too long on errors.
backup fails when missing to cleanup composite parts
Apparently GCS failed just at the moment when uploading and composing part 3179 was done, and wal-g tried to delete the composite parts.
<13>Dec 9 00:00:01 backup.sh: INFO: 2020/12/09 02:04:56.856506 Finished writing part 3179.
<13>Dec 9 00:00:01 backup.sh: INFO: 2020/12/09 02:04:56.856548 Starting part 3187 ...
<13>Dec 9 00:00:01 backup.sh: INFO: 2020/12/09 02:04:57.665167 Finished writing part 3180.
<13>Dec 9 00:00:01 backup.sh: INFO: 2020/12/09 02:04:58.793366 Starting part 3188 ...
<13>Dec 9 00:00:01 backup.sh: INFO: 2020/12/09 02:05:03.142654 Finished writing part 3181.
<13>Dec 9 00:00:01 backup.sh: ERROR: 2020/12/09 02:05:09.417098 Failed to run a retriable func. Err: storage: object doesn't exist, retrying attempt 0
<13>Dec 9 00:00:01 backup.sh: ERROR: 2020/12/09 02:05:09.431631 Failed to run a retriable func. Err: storage: object doesn't exist, retrying attempt 1
<13>Dec 9 00:00:01 backup.sh: ERROR: 2020/12/09 02:05:09.612694 Failed to run a retriable func. Err: storage: object doesn't exist, retrying attempt 2
<13>Dec 9 00:00:01 backup.sh: ERROR: 2020/12/09 02:05:09.999934 Failed to run a retriable func. Err: storage: object doesn't exist, retrying attempt 3
<13>Dec 9 00:00:01 backup.sh: ERROR: 2020/12/09 02:05:10.667517 Failed to run a retriable func. Err: storage: object doesn't exist, retrying attempt 4
<13>Dec 9 00:00:01 backup.sh: ERROR: 2020/12/09 02:05:12.002879 Failed to run a retriable func. Err: storage: object doesn't exist, retrying attempt 5
<13>Dec 9 00:00:01 backup.sh: ERROR: 2020/12/09 02:05:15.451436 Failed to run a retriable func. Err: storage: object doesn't exist, retrying attempt 6
<13>Dec 9 00:00:01 backup.sh: ERROR: 2020/12/09 02:05:20.455462 Failed to run a retriable func. Err: storage: object doesn't exist, retrying attempt 7
<13>Dec 9 00:00:01 backup.sh: ERROR: 2020/12/09 02:05:30.321180 Failed to run a retriable func. Err: storage: object doesn't exist, retrying attempt 8
<13>Dec 9 00:00:01 backup.sh: ERROR: 2020/12/09 02:05:52.629282 Failed to run a retriable func. Err: storage: object doesn't exist, retrying attempt 9
<13>Dec 9 00:00:01 backup.sh: ERROR: 2020/12/09 02:06:44.111167 Failed to run a retriable func. Err: storage: object doesn't exist, retrying attempt 10
<13>Dec 9 00:00:01 backup.sh: ERROR: 2020/12/09 02:08:46.191842 Failed to run a retriable func. Err: storage: object doesn't exist, retrying attempt 11
<13>Dec 9 00:00:01 backup.sh: ERROR: 2020/12/09 02:12:35.924007 Failed to run a retriable func. Err: storage: object doesn't exist, retrying attempt 12
<13>Dec 9 00:00:01 backup.sh: ERROR: 2020/12/09 02:14:29.377187 GCS error : Unable to delete temporary chunks: GCS error : Unable to delete a temporary chunk: context deadline exceeded
<13>Dec 9 00:00:01 backup.sh: ERROR: 2020/12/09 02:14:29.377220 upload: could not upload 'base_000000040003836F000000A5/tar_partitions/part_3179.tar.br'
<13>Dec 9 00:00:01 backup.sh: ERROR: 2020/12/09 02:14:29.377225 GCS error : Unable to delete temporary chunks: GCS error : Unable to delete a temporary chunk: context deadline exceeded
<13>Dec 9 00:00:01 backup.sh: ERROR: 2020/12/09 02:14:29.377288 Unable to continue the backup process because of the loss of a part 3179.
I think wal-g managed to delete chunks 0-4, then GCS maybe returned a failure for deleting chunk 5 but chunk 5 actually was deleted, and then wal-g re-tried to delete chunk 5 - always getting object doesn't exist - until the deadline kicked in. And this made the whole backup fail.
I checked and part 3179 actually is there and also the chunks that couldn't be deleted:
Henris-MBP:~ hphilipps$ gsutil ls -l gs://gitlab-gprd-postgres-backup/pitr-walg-pg11/basebackups_005/base_000000040003836F000000A5/tar_partitions/part_3179.tar.br
436441173 2020-12-09T02:04:57Z gs://gitlab-gprd-postgres-backup/pitr-walg-pg11/basebackups_005/base_000000040003836F000000A5/tar_partitions/part_3179.tar.br
TOTAL: 1 objects, 436441173 bytes (416.22 MiB)
Henris-MBP:~ hphilipps$ gsutil ls -l gs://gitlab-gprd-postgres-backup/pitr-walg-pg11/basebackups_005/base_000000040003836F000000A5/tar_partitions/part_3179.tar.br_chunks/
52428800 2020-12-09T02:04:52Z gs://gitlab-gprd-postgres-backup/pitr-walg-pg11/basebackups_005/base_000000040003836F000000A5/tar_partitions/part_3179.tar.br_chunks/chunk6
52428800 2020-12-09T02:04:56Z gs://gitlab-gprd-postgres-backup/pitr-walg-pg11/basebackups_005/base_000000040003836F000000A5/tar_partitions/part_3179.tar.br_chunks/chunk7
17010773 2020-12-09T02:04:57Z gs://gitlab-gprd-postgres-backup/pitr-walg-pg11/basebackups_005/base_000000040003836F000000A5/tar_partitions/part_3179.tar.br_chunks/chunk8
TOTAL: 3 objects, 121868373 bytes (116.22 MiB)
Suggestion:
- When cleaning up, we should not fail on
object doesn't existerrors - just print a warning - Failing to cleaning up chunks shouldn't be a fatal error stopping the whole backup
Backup fails because of serious GCS outage and short GCS context timeout
The backup was restarted by the EOC, but then failed on multiple parts at the same time - GCS must have had serious problems. Maybe with more than 10m context timeout we would have been able to retry long enough to survive this. So this isn't a fault of wal-g in this case:
<13>Dec 9 02:39:12 backup.sh: INFO: 2020/12/09 04:32:43.855286 Starting part 2828 ...
<13>Dec 9 02:39:12 backup.sh: WARNING: 2020/12/09 04:34:42.924392 Unable to copy an object chunk base_000000040003838C000000F8/tar_partitions/part_2826.tar.br_chunks/chunk1, part 1, err:
googleapi: got HTTP response code 503 with body:
<13>Dec 9 02:39:12 backup.sh: WARNING: 2020/12/09 04:34:42.924426 Unable to close object writer base_000000040003838C000000F8/tar_partitions/part_2826.tar.br_chunks/chunk1, part 1, err: g
oogleapi: got HTTP response code 503 with body:
<13>Dec 9 02:39:12 backup.sh: ERROR: 2020/12/09 04:34:42.924432 Failed to run a retriable func. Err: googleapi: got HTTP response code 503 with body: , retrying attempt 0
<13>Dec 9 02:39:12 backup.sh: INFO: 2020/12/09 04:35:56.501148 Finished writing part 2820.
<13>Dec 9 02:39:12 backup.sh: INFO: 2020/12/09 04:35:56.501253 Starting part 2829 ...
<13>Dec 9 02:39:12 backup.sh: INFO: 2020/12/09 04:35:59.029180 Finished writing part 2822.
<13>Dec 9 02:39:12 backup.sh: INFO: 2020/12/09 04:35:59.029211 Starting part 2830 ...
<13>Dec 9 02:39:12 backup.sh: INFO: 2020/12/09 04:37:00.796524 Finished writing part 2824.
<13>Dec 9 02:39:12 backup.sh: INFO: 2020/12/09 04:37:00.796608 Starting part 2831 ...
<13>Dec 9 02:39:12 backup.sh: INFO: 2020/12/09 04:37:04.137069 Finished writing part 2823.
<13>Dec 9 02:39:12 backup.sh: INFO: 2020/12/09 04:37:04.137107 Starting part 2832 ...
<13>Dec 9 02:39:12 backup.sh: WARNING: 2020/12/09 04:39:54.072958 Unable to copy an object chunk base_000000040003838C000000F8/tar_partitions/part_2828.tar.br_chunks/chunk3, part 3, err: googleapi: got HTTP response code 503 with body:
<13>Dec 9 02:39:12 backup.sh: WARNING: 2020/12/09 04:39:54.073007 Unable to close object writer base_000000040003838C000000F8/tar_partitions/part_2828.tar.br_chunks/chunk3, part 3, err: googleapi: got HTTP response code 503 with body:
<13>Dec 9 02:39:12 backup.sh: ERROR: 2020/12/09 04:39:54.073010 Failed to run a retriable func. Err: googleapi: got HTTP response code 503 with body: , retrying attempt 0
<13>Dec 9 02:39:12 backup.sh: WARNING: 2020/12/09 04:40:44.952144 Unable to copy an object chunk base_000000040003838C000000F8/tar_partitions/part_2830.tar.br_chunks/chunk2, part 2, err: googleapi: got HTTP response code 503 with body:
<13>Dec 9 02:39:12 backup.sh: WARNING: 2020/12/09 04:40:44.952176 Unable to close object writer base_000000040003838C000000F8/tar_partitions/part_2830.tar.br_chunks/chunk2, part 2, err: googleapi: got HTTP response code 503 with body:
<13>Dec 9 02:39:12 backup.sh: ERROR: 2020/12/09 04:40:44.952185 Failed to run a retriable func. Err: googleapi: got HTTP response code 503 with body: , retrying attempt 0
<13>Dec 9 02:39:12 backup.sh: WARNING: 2020/12/09 04:40:54.395088 Unable to copy an object chunk base_000000040003838C000000F8/tar_partitions/part_2829.tar.br_chunks/chunk2, part 2, err: googleapi: got HTTP response code 503 with body:
<13>Dec 9 02:39:12 backup.sh: WARNING: 2020/12/09 04:40:54.395122 Unable to close object writer base_000000040003838C000000F8/tar_partitions/part_2829.tar.br_chunks/chunk2, part 2, err: googleapi: got HTTP response code 503 with body:
<13>Dec 9 02:39:12 backup.sh: ERROR: 2020/12/09 04:40:54.395128 Failed to run a retriable func. Err: googleapi: got HTTP response code 503 with body: , retrying attempt 0
<13>Dec 9 02:39:12 backup.sh: WARNING: 2020/12/09 04:41:14.960228 Unable to copy an object chunk base_000000040003838C000000F8/tar_partitions/part_2832.tar.br_chunks/chunk1, part 1, err: googleapi: got HTTP response code 503 with body:
<13>Dec 9 02:39:12 backup.sh: WARNING: 2020/12/09 04:41:14.960264 Unable to close object writer base_000000040003838C000000F8/tar_partitions/part_2832.tar.br_chunks/chunk1, part 1, err: googleapi: got HTTP response code 503 with body:
<13>Dec 9 02:39:12 backup.sh: ERROR: 2020/12/09 04:41:14.960268 Failed to run a retriable func. Err: googleapi: got HTTP response code 503 with body: , retrying attempt 0
<13>Dec 9 02:39:12 backup.sh: INFO: 2020/12/09 04:42:12.277519 Finished writing part 2825.
<13>Dec 9 02:39:12 backup.sh: INFO: 2020/12/09 04:42:12.277555 Starting part 2833 ...
<13>Dec 9 02:39:12 backup.sh: WARNING: 2020/12/09 04:42:37.292358 Unable to close object writer base_000000040003838C000000F8/tar_partitions/part_2825.tar.br_chunks/chunk6, part 6, err: Post https://storage.googleapis.com/upload/storage/v1/b/gitlab-gprd-postgres-backup/o?alt=json&name=pitr-walg-pg11%2Fbasebackups_005%2Fbase_000000040003838C000000F8%2Ftar_partitions%2Fpart_2825.tar.br_chunks%2Fchunk6&prettyPrint=false&projection=full&uploadType=multipart: context deadline exceeded
<13>Dec 9 02:39:12 backup.sh: ERROR: 2020/12/09 04:42:37.293047 Failed to run a retriable func. Err: context deadline exceeded, retrying attempt 0
<13>Dec 9 02:39:12 backup.sh: ERROR: 2020/12/09 04:42:37.322158 GCS error : Unable to compose object: context deadline exceeded
<13>Dec 9 02:39:12 backup.sh: ERROR: 2020/12/09 04:42:37.322187 upload: could not upload 'base_000000040003838C000000F8/tar_partitions/part_2825.tar.br'
<13>Dec 9 02:39:12 backup.sh: ERROR: 2020/12/09 04:42:37.322191 GCS error : Unable to compose object: context deadline exceeded
<13>Dec 9 02:39:12 backup.sh: ERROR: 2020/12/09 04:42:37.322199 Unable to continue the backup process because of the loss of a part 2825.
Backup fails when closing writer of a chunk fails?
This one isn't fully clear to me, as the logs don't contain information which retriable func (upload, compose, cleanup) failed. But my assumption is this:
Closing the writer for chunk 6 of part 6628 failed - but we don't seem to retry when closing the writer failed on upload, thus chunk 6 does not exist and when trying to compose, we get 404 errors until we timeout:
13>Dec 9 05:43:15 backup.sh: INFO: 2020/12/09 10:19:03.258079 Finished writing part 6634.
<13>Dec 9 05:43:15 backup.sh: WARNING: 2020/12/09 10:19:27.668491 Unable to close object writer base_00000004000383BC0000006B/tar_partitions/part_6628.tar.br_chunks/chunk6, part 6, err: g
oogleapi: got HTTP response code 503 with body:
<13>Dec 9 05:43:15 backup.sh: ERROR: 2020/12/09 10:19:27.792494 Failed to run a retriable func. Err: googleapi: Error 404: Object pitr-walg-pg11/basebackups_005/base_00000004000383BC0000006B/tar_partitions/part_6628.tar.br_chunks/chunk6 (generation: 0) not found., notFound, retrying attempt 0
<13>Dec 9 05:43:15 backup.sh: ERROR: 2020/12/09 10:19:27.938009 Failed to run a retriable func. Err: googleapi: Error 404: Object pitr-walg-pg11/basebackups_005/base_00000004000383BC0000006B/tar_partitions/part_6628.tar.br_chunks/chunk6 (generation: 0) not found., notFound, retrying attempt 1
<13>Dec 9 05:43:15 backup.sh: ERROR: 2020/12/09 10:19:28.128904 Failed to run a retriable func. Err: googleapi: Error 404: Object pitr-walg-pg11/basebackups_005/base_00000004000383BC0000006B/tar_partitions/part_6628.tar.br_chunks/chunk6 (generation: 0) not found., notFound, retrying attempt 2
<13>Dec 9 05:43:15 backup.sh: ERROR: 2020/12/09 10:19:28.508147 Failed to run a retriable func. Err: googleapi: Error 404: Object pitr-walg-pg11/basebackups_005/base_00000004000383BC0000006B/tar_partitions/part_6628.tar.br_chunks/chunk6 (generation: 0) not found., notFound, retrying attempt 3
<13>Dec 9 05:43:15 backup.sh: ERROR: 2020/12/09 10:19:29.531231 Failed to run a retriable func. Err: googleapi: Error 404: Object pitr-walg-pg11/basebackups_005/base_00000004000383BC0000006B/tar_partitions/part_6628.tar.br_chunks/chunk6 (generation: 0) not found., notFound, retrying attempt 4
<13>Dec 9 05:43:15 backup.sh: ERROR: 2020/12/09 10:19:31.349780 Failed to run a retriable func. Err: googleapi: Error 404: Object pitr-walg-pg11/basebackups_005/base_00000004000383BC0000006B/tar_partitions/part_6628.tar.br_chunks/chunk6 (generation: 0) not found., notFound, retrying attempt 5
<13>Dec 9 05:43:15 backup.sh: ERROR: 2020/12/09 10:19:35.075894 Failed to run a retriable func. Err: googleapi: Error 404: Object pitr-walg-pg11/basebackups_005/base_00000004000383BC0000006B/tar_partitions/part_6628.tar.br_chunks/chunk6 (generation: 0) not found., notFound, retrying attempt 6
<13>Dec 9 05:43:15 backup.sh: ERROR: 2020/12/09 10:19:42.193005 Failed to run a retriable func. Err: googleapi: Error 404: Object pitr-walg-pg11/basebackups_005/base_00000004000383BC0000006B/tar_partitions/part_6628.tar.br_chunks/chunk6 (generation: 0) not found., notFound, retrying attempt 7
<13>Dec 9 05:43:15 backup.sh: ERROR: 2020/12/09 10:19:51.924669 Failed to run a retriable func. Err: googleapi: Error 404: Object pitr-walg-pg11/basebackups_005/base_00000004000383BC0000006B/tar_partitions/part_6628.tar.br_chunks/chunk6 (generation: 0) not found., notFound, retrying attempt 8
<13>Dec 9 05:43:15 backup.sh: ERROR: 2020/12/09 10:20:15.367526 Failed to run a retriable func. Err: googleapi: Error 404: Object pitr-walg-pg11/basebackups_005/base_00000004000383BC0000006B/tar_partitions/part_6628.tar.br_chunks/chunk6 (generation: 0) not found., notFound, retrying attempt 9
<13>Dec 9 05:43:15 backup.sh: ERROR: 2020/12/09 10:21:17.551962 Failed to run a retriable func. Err: googleapi: Error 404: Object pitr-walg-pg11/basebackups_005/base_00000004000383BC0000006B/tar_partitions/part_6628.tar.br_chunks/chunk6 (generation: 0) not found., notFound, retrying attempt 10
<13>Dec 9 05:43:15 backup.sh: ERROR: 2020/12/09 10:23:07.873625 Failed to run a retriable func. Err: googleapi: Error 404: Object pitr-walg-pg11/basebackups_005/base_00000004000383BC0000006B/tar_partitions/part_6628.tar.br_chunks/chunk6 (generation: 0) not found., notFound, retrying attempt 11
<13>Dec 9 05:43:15 backup.sh: ERROR: 2020/12/09 10:25:30.905058 Failed to run a retriable func. Err: googleapi: Error 404: Object pitr-walg-pg11/basebackups_005/base_00000004000383BC0000006B/tar_partitions/part_6628.tar.br_chunks/chunk6 (generation: 0) not found., notFound, retrying attempt 12
<13>Dec 9 05:43:15 backup.sh: ERROR: 2020/12/09 10:28:03.582246 GCS error : Unable to compose object: context deadline exceeded
<13>Dec 9 05:43:15 backup.sh: ERROR: 2020/12/09 10:28:03.582279 upload: could not upload 'base_00000004000383BC0000006B/tar_partitions/part_6628.tar.br'
<13>Dec 9 05:43:15 backup.sh: ERROR: 2020/12/09 10:28:03.582284 GCS error : Unable to compose object: context deadline exceeded
<13>Dec 9 05:43:15 backup.sh: ERROR: 2020/12/09 10:28:03.582290 Unable to continue the backup process because of the loss of a part 6628.
Suggestion:
- Make sure we retry when failing to close the writer on uploads. Maybe overriding the
errreturn var in the defer func doesn't work as expected? https://github.com/wal-g/storages/blob/master/gcs/uploader.go#L75
/cc @NikolayS