S3 retry delayer shares one backoff across all requests

The S3 driver builds one backoff object and reuses it for every request and every retry, for the whole life of the client (internal/storage/driver/s3/requestcontrol.go, in newCustomDelayer). It is never reset and ignores the SDK's per-attempt counter.

@vespian_gl (S06 DRI), requesting your call. This was ported from Container Registry s3-aws/v2, so you have the original context.

Is the current code broken?

Two real issues, both unrelated to the backoff version:

  1. Data race. cenkalti/backoff is not thread-safe, but one instance is shared across concurrent S3 requests. Verified: go test -race flags a race on the backoff's currentInterval (read in NextBackOff, write in incrementCurrentInterval), reached through BackoffDelay. It reproduces on v6 too, so the bump does not fix it.
  2. Backoff is not per-request. Because the object is shared and never reset, its interval only grows and then sticks at the 60s cap. A new request's first retry can wait close to 60s instead of starting near 500ms. The code says it should "behave like its predecessor", but the predecessor (CR v1 s3wrapper.go) used a fresh backoff per request.

What changes if we move to v6 (!598 (closed))?

  • It does not compile until we drop one line: v6 removed the MaxElapsedTime field.
  • Dropping it fixes a v4 bug: today, after 15 minutes of uptime the backoff returns -1 forever, which the SDK reads as "no wait", so retries lose their delay (only the rate limiter still paces them). On v6 retries keep a real, bounded delay.
  • It does not fix the two issues above. Those come from the shared object, not the version.
  • The delay math is otherwise identical.

Decision needed

The clean fix is a stateless delayer that computes the delay from the attempt number (this is how the AWS SDK's own backoff works). It fixes both issues and is safe under concurrency. It also removes the need for cenkalti/backoff, and this file is the only thing in AR that uses it.

Pick one:

  1. Stateless delayer, drop cenkalti/backoff. Fixes everything. Closes !598 (closed) (we leave v4 by removing it).
  2. Stateless delayer, keep cenkalti/backoff for its default numbers. Fixes everything, then land the v6 bump.

If you decide the race and the per-request behavior are low enough priority to defer, that's a conscious wontfix on this issue, and the v6 bump can land on its own to fix the v4 -1 bug in the meantime.

  • internal/storage/driver/s3/requestcontrol.go
  • CR s3-aws/v2/requestcontrol.go (source), s3-aws/v1/s3wrapper.go (per-request predecessor)
  • aws-sdk-go-v2 aws/retry/jitter_backoff.go (stateless example)
  • !598 (closed)
Edited by João Pereira