S3 retry delayer shares one backoff across all requests
The S3 driver builds one backoff object and reuses it for every request and every retry, for the whole life of the client (internal/storage/driver/s3/requestcontrol.go, in newCustomDelayer). It is never reset and ignores the SDK's per-attempt counter.
@vespian_gl (S06 DRI), requesting your call. This was ported from Container Registry s3-aws/v2, so you have the original context.
Is the current code broken?
Two real issues, both unrelated to the backoff version:
- Data race.
cenkalti/backoffis not thread-safe, but one instance is shared across concurrent S3 requests. Verified:go test -raceflags a race on the backoff'scurrentInterval(read inNextBackOff, write inincrementCurrentInterval), reached throughBackoffDelay. It reproduces on v6 too, so the bump does not fix it. - Backoff is not per-request. Because the object is shared and never reset, its interval only grows and then sticks at the 60s cap. A new request's first retry can wait close to 60s instead of starting near 500ms. The code says it should "behave like its predecessor", but the predecessor (CR v1
s3wrapper.go) used a fresh backoff per request.
What changes if we move to v6 (!598 (closed))?
- It does not compile until we drop one line: v6 removed the
MaxElapsedTimefield. - Dropping it fixes a v4 bug: today, after 15 minutes of uptime the backoff returns -1 forever, which the SDK reads as "no wait", so retries lose their delay (only the rate limiter still paces them). On v6 retries keep a real, bounded delay.
- It does not fix the two issues above. Those come from the shared object, not the version.
- The delay math is otherwise identical.
Decision needed
The clean fix is a stateless delayer that computes the delay from the attempt number (this is how the AWS SDK's own backoff works). It fixes both issues and is safe under concurrency. It also removes the need for cenkalti/backoff, and this file is the only thing in AR that uses it.
Pick one:
- Stateless delayer, drop
cenkalti/backoff. Fixes everything. Closes !598 (closed) (we leave v4 by removing it). - Stateless delayer, keep
cenkalti/backofffor its default numbers. Fixes everything, then land the v6 bump.
If you decide the race and the per-request behavior are low enough priority to defer, that's a conscious wontfix on this issue, and the v6 bump can land on its own to fix the v4 -1 bug in the meantime.
Links
internal/storage/driver/s3/requestcontrol.go- CR
s3-aws/v2/requestcontrol.go(source),s3-aws/v1/s3wrapper.go(per-request predecessor) - aws-sdk-go-v2
aws/retry/jitter_backoff.go(stateless example) - !598 (closed)