fix(kinesis): ProvisionedThroughputExceededException retry uses fixed 5 s delay — thundering herd risk
## Summary MR !943 adds retry logic for `ProvisionedThroughputExceededException` in `logpipe/backend/kinesis.py`, which is a good idea. However the backoff duration is a **fixed constant** (`time.sleep(5)`). Under Kinesis throttling, all consumer instances that are reading from the same stream hit the exception at the same time, sleep for exactly 5 seconds, and retry at the same instant — a classic thundering herd that can turn a brief throttle into a sustained overload on the stream. ## Where it happens `logpipe/backend/kinesis.py` (added by MR !943): ```python if e.response["Error"]["Code"] == "ProvisionedThroughputExceededException": logger.warning("Caught ProvisionedThroughputExceededException. Sleeping for 5 seconds.") time.sleep(5) # fixed — no jitter ``` ## Why this matters When a Kinesis stream is throttled: 1. All consumers reading that stream get `ProvisionedThroughputExceededException` at roughly the same time. 2. With a fixed 5-second sleep, they all wake up at exactly the same moment and retry simultaneously. 3. The retry burst can re-trigger the throttle, causing repeated waves of failures. This pattern is described in the [AWS Architecture Blog — Exponential Backoff and Jitter (Marc Brooker)](https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/). The recommended fix is **full jitter** or at minimum **equal jitter**. ## Suggested fix ```python import random if e.response["Error"]["Code"] == "ProvisionedThroughputExceededException": jitter = random.uniform(1.0, 5.0) # spread retries across [1s, 5s] logger.warning( "Caught ProvisionedThroughputExceededException. Sleeping for %.1f seconds.", jitter, ) time.sleep(jitter) ``` This spreads retries across a [1 s, 5 s] window so no two consumers retry at exactly the same millisecond. ## How this was found This issue was identified by [Quorum](https://github.com/KaustubhUp025/quorum), an open-source Gemini-powered agent that reviews merge requests for distributed coordination anti-patterns (thundering herds, missing saga compensation, lost updates, transactional outbox violations, and more). It uses code search to verify findings across the full repository before reporting. Happy to open a follow-up MR with the jitter fix if that would be helpful.
issue