fix(kinesis): ProvisionedThroughputExceededException retry uses fixed 5 s delay — thundering herd risk
## Summary
MR !943 adds retry logic for `ProvisionedThroughputExceededException` in `logpipe/backend/kinesis.py`, which is a good idea. However the backoff duration is a **fixed constant** (`time.sleep(5)`). Under Kinesis throttling, all consumer instances that are reading from the same stream hit the exception at the same time, sleep for exactly 5 seconds, and retry at the same instant — a classic thundering herd that can turn a brief throttle into a sustained overload on the stream.
## Where it happens
`logpipe/backend/kinesis.py` (added by MR !943):
```python
if e.response["Error"]["Code"] == "ProvisionedThroughputExceededException":
logger.warning("Caught ProvisionedThroughputExceededException. Sleeping for 5 seconds.")
time.sleep(5) # fixed — no jitter
```
## Why this matters
When a Kinesis stream is throttled:
1. All consumers reading that stream get `ProvisionedThroughputExceededException` at roughly the same time.
2. With a fixed 5-second sleep, they all wake up at exactly the same moment and retry simultaneously.
3. The retry burst can re-trigger the throttle, causing repeated waves of failures.
This pattern is described in the [AWS Architecture Blog — Exponential Backoff and Jitter (Marc Brooker)](https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/). The recommended fix is **full jitter** or at minimum **equal jitter**.
## Suggested fix
```python
import random
if e.response["Error"]["Code"] == "ProvisionedThroughputExceededException":
jitter = random.uniform(1.0, 5.0) # spread retries across [1s, 5s]
logger.warning(
"Caught ProvisionedThroughputExceededException. Sleeping for %.1f seconds.",
jitter,
)
time.sleep(jitter)
```
This spreads retries across a [1 s, 5 s] window so no two consumers retry at exactly the same millisecond.
## How this was found
This issue was identified by [Quorum](https://github.com/KaustubhUp025/quorum), an open-source Gemini-powered agent that reviews merge requests for distributed coordination anti-patterns (thundering herds, missing saga compensation, lost updates, transactional outbox violations, and more). It uses code search to verify findings across the full repository before reporting.
Happy to open a follow-up MR with the jitter fix if that would be helpful.
issue