Reduce max retries for wal-g wal-push
Purpose
Essentially we are telling wal-g
to give up sooner (i.e. after a little over 30 seconds, rather than 10 minutes), and instead letting the retry behavior at the Postgres level take over by spawning a fresh wal-g process.
Background: Why does WAL archiving stall for 10 minutes at a time?
Currently we use wal-g
to archive Postgres WAL files to object storage (GCS).
Each time Postgres archives a WAL file, it runs its archive_command
, which in turn runs a (normally) short-lived wal-g wal-push
process.
Occasionally an upload attempt is doomed to fail all retry attempts due to the GCS service. See the discovery notes from @alexander-sosna here: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15362#note_882603775
Example:
ERROR: 2022/03/20 13:36:05.880231 Failed to run a retryable func. Err: googleapi: Error 404: Object pitr-walg-pg12/wal_005/000000050007798000000080.br_chunks/chunk0 (generation: 0) not found., notFound, retrying attempt 0
ERROR: 2022/03/20 13:36:05.906443 Failed to run a retryable func. Err: googleapi: Error 404: Object pitr-walg-pg12/wal_005/000000050007798000000080.br_chunks/chunk0 (generation: 0) not found., notFound, retrying attempt 1
ERROR: 2022/03/20 13:36:06.181170 Failed to run a retryable func. Err: googleapi: Error 404: Object pitr-walg-pg12/wal_005/000000050007798000000080.br_chunks/chunk0 (generation: 0) not found., notFound, retrying attempt 2
...
These retries use an exponential backoff and ultimately abort due to either the GCS context deadline expiring or running out of retry attempts. Currently it takes 10 minutes before giving up.
During that 10-minute timespan where wal-g wal-push
is effectively stalled, a backlog of WAL files accumulates. Accumulating such a backlog is undesirable for several reasons (for more background, see https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15362#note_877494437).
To avoid accumulating that backlog of WAL files, it is better for the wal-g wal-push
process to fail fast when it has transient trouble interacting with GCS.
Rather than having the wal-g process internally retry for many minutes, we would rather have that process fail and let Postgres retry by spawning a fresh wal-g process. The new wal-g process will establish a new GCS context, which has a better chance of succeeding.
In the same thread https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15362#note_882603775 we discussed a couple options and settled on preferring to tune GCS_MAX_RETRIES
rather than GCS_CONTEXT_TIMEOUT
, but both appear to be viable alternatives.
Example
This is the growth pattern we want to avoid:
Whenever wal-g wal-push
stalls for 10 minutes, the backlog grows throughout that time, and subsequent wal-g processes have to struggle to catch up quickly.
We aim to reduce the stall duration, so that the backlog does not grow so large.
And for the record, here is the series of wal-g wal-push
retry failures corresponding to those steady rises in backlog size:
First rise (19:28 - 19:38 UTC):
ERROR: 2022/03/23 19:28:07.405591 Failed to run a retryable func. Err: googleapi: Error 404: Object pitr-walg-pg12/wal_005/000000050007846800000060.br_chunks/chunk0 (generation: 0) not found., notFound, retrying attempt 0
ERROR: 2022/03/23 19:28:07.427102 Failed to run a retryable func. Err: googleapi: Error 404: Object pitr-walg-pg12/wal_005/000000050007846800000060.br_chunks/chunk0 (generation: 0) not found., notFound, retrying attempt 1
ERROR: 2022/03/23 19:28:07.692228 Failed to run a retryable func. Err: googleapi: Error 404: Object pitr-walg-pg12/wal_005/000000050007846800000060.br_chunks/chunk0 (generation: 0) not found., notFound, retrying attempt 2
ERROR: 2022/03/23 19:28:08.203930 Failed to run a retryable func. Err: googleapi: Error 404: Object pitr-walg-pg12/wal_005/000000050007846800000060.br_chunks/chunk0 (generation: 0) not found., notFound, retrying attempt 3
ERROR: 2022/03/23 19:28:08.968861 Failed to run a retryable func. Err: googleapi: Error 404: Object pitr-walg-pg12/wal_005/000000050007846800000060.br_chunks/chunk0 (generation: 0) not found., notFound, retrying attempt 4
ERROR: 2022/03/23 19:28:10.450047 Failed to run a retryable func. Err: googleapi: Error 404: Object pitr-walg-pg12/wal_005/000000050007846800000060.br_chunks/chunk0 (generation: 0) not found., notFound, retrying attempt 5
ERROR: 2022/03/23 19:28:13.932587 Failed to run a retryable func. Err: googleapi: Error 404: Object pitr-walg-pg12/wal_005/000000050007846800000060.br_chunks/chunk0 (generation: 0) not found., notFound, retrying attempt 6
ERROR: 2022/03/23 19:28:18.317006 Failed to run a retryable func. Err: googleapi: Error 404: Object pitr-walg-pg12/wal_005/000000050007846800000060.br_chunks/chunk0 (generation: 0) not found., notFound, retrying attempt 7
ERROR: 2022/03/23 19:28:27.816777 Failed to run a retryable func. Err: googleapi: Error 404: Object pitr-walg-pg12/wal_005/000000050007846800000060.br_chunks/chunk0 (generation: 0) not found., notFound, retrying attempt 8
ERROR: 2022/03/23 19:28:45.810573 Failed to run a retryable func. Err: googleapi: Error 404: Object pitr-walg-pg12/wal_005/000000050007846800000060.br_chunks/chunk0 (generation: 0) not found., notFound, retrying attempt 9
ERROR: 2022/03/23 19:29:28.466821 Failed to run a retryable func. Err: googleapi: Error 404: Object pitr-walg-pg12/wal_005/000000050007846800000060.br_chunks/chunk0 (generation: 0) not found., notFound, retrying attempt 10
ERROR: 2022/03/23 19:31:07.801014 Failed to run a retryable func. Err: googleapi: Error 404: Object pitr-walg-pg12/wal_005/000000050007846800000060.br_chunks/chunk0 (generation: 0) not found., notFound, retrying attempt 11
ERROR: 2022/03/23 19:35:05.572249 Failed to run a retryable func. Err: googleapi: Error 404: Object pitr-walg-pg12/wal_005/000000050007846800000060.br_chunks/chunk0 (generation: 0) not found., notFound, retrying attempt 12
ERROR: 2022/03/23 19:37:59.390423 GCS error : Failed to compose temporary chunks into an object: GCS error : Unable to compose object: context deadline exceeded
ERROR: 2022/03/23 19:37:59.390512 Error of background uploader: upload: could not Upload 'pg_wal/000000050007846800000060'
Second rise (19:44 - 19:54 UTC):
ERROR: 2022/03/23 19:44:45.403147 Failed to run a retryable func. Err: googleapi: Error 404: Object pitr-walg-pg12/wal_005/0000000500078470000000CF.br_chunks/chunk0 (generation: 0) not found., notFound, retrying attempt 0
ERROR: 2022/03/23 19:44:45.422561 Failed to run a retryable func. Err: googleapi: Error 404: Object pitr-walg-pg12/wal_005/0000000500078470000000CF.br_chunks/chunk0 (generation: 0) not found., notFound, retrying attempt 1
ERROR: 2022/03/23 19:44:45.690802 Failed to run a retryable func. Err: googleapi: Error 404: Object pitr-walg-pg12/wal_005/0000000500078470000000CF.br_chunks/chunk0 (generation: 0) not found., notFound, retrying attempt 2
ERROR: 2022/03/23 19:44:46.134981 Failed to run a retryable func. Err: googleapi: Error 404: Object pitr-walg-pg12/wal_005/0000000500078470000000CF.br_chunks/chunk0 (generation: 0) not found., notFound, retrying attempt 3
ERROR: 2022/03/23 19:44:46.890602 Failed to run a retryable func. Err: googleapi: Error 404: Object pitr-walg-pg12/wal_005/0000000500078470000000CF.br_chunks/chunk0 (generation: 0) not found., notFound, retrying attempt 4
ERROR: 2022/03/23 19:44:48.376210 Failed to run a retryable func. Err: googleapi: Error 404: Object pitr-walg-pg12/wal_005/0000000500078470000000CF.br_chunks/chunk0 (generation: 0) not found., notFound, retrying attempt 5
ERROR: 2022/03/23 19:44:51.857314 Failed to run a retryable func. Err: googleapi: Error 404: Object pitr-walg-pg12/wal_005/0000000500078470000000CF.br_chunks/chunk0 (generation: 0) not found., notFound, retrying attempt 6
ERROR: 2022/03/23 19:44:56.241260 Failed to run a retryable func. Err: googleapi: Error 404: Object pitr-walg-pg12/wal_005/0000000500078470000000CF.br_chunks/chunk0 (generation: 0) not found., notFound, retrying attempt 7
ERROR: 2022/03/23 19:45:05.756913 Failed to run a retryable func. Err: googleapi: Error 404: Object pitr-walg-pg12/wal_005/0000000500078470000000CF.br_chunks/chunk0 (generation: 0) not found., notFound, retrying attempt 8
ERROR: 2022/03/23 19:45:23.750269 Failed to run a retryable func. Err: googleapi: Error 404: Object pitr-walg-pg12/wal_005/0000000500078470000000CF.br_chunks/chunk0 (generation: 0) not found., notFound, retrying attempt 9
ERROR: 2022/03/23 19:46:06.416723 Failed to run a retryable func. Err: googleapi: Error 404: Object pitr-walg-pg12/wal_005/0000000500078470000000CF.br_chunks/chunk0 (generation: 0) not found., notFound, retrying attempt 10
ERROR: 2022/03/23 19:47:45.777192 Failed to run a retryable func. Err: googleapi: Error 404: Object pitr-walg-pg12/wal_005/0000000500078470000000CF.br_chunks/chunk0 (generation: 0) not found., notFound, retrying attempt 11
ERROR: 2022/03/23 19:51:43.527474 Failed to run a retryable func. Err: googleapi: Error 404: Object pitr-walg-pg12/wal_005/0000000500078470000000CF.br_chunks/chunk0 (generation: 0) not found., notFound, retrying attempt 12
ERROR: 2022/03/23 19:54:43.470479 GCS error : Failed to compose temporary chunks into an object: GCS error : Unable to compose object: context deadline exceeded
ERROR: 2022/03/23 19:54:43.470669 upload: could not Upload 'pg_wal/0000000500078470000000CF'