Indexer resilience: ingestion task fails silently + RPC calls have no timeouts (launch-day risk)
two tightly-coupled indexer resilience gaps surfaced during a code review pass against `390f9be`. both are launch-day failure modes that compound — RPC hang triggers ingestion silent exit, ingestion silent exit means HTTP API serves stale data forever, no path to recover without process restart.
## Gap 1 — ingestion task fail-silent in `main.rs:31-35`
spawned task swallows errors with just `tracing::error!` and exits cleanly:
```
let ingestion_handle = tokio::spawn(async move {
if let Err(e) = ingestion::run(&ingest_pool, &ingest_config).await {
tracing::error!(?e, "ingestion failed");
}
});
```
after exit, the HTTP API keeps responding 200 OK to all `/v1/*` reads, just from frozen state. `ingestion_handle` is not awaited or supervised. `/v1/status` exposes `max_indexed_block` but no client reads it as a liveness check, and there is no `ingestion_alive: bool` or `last_indexed_at_ms` to alert against.
failure modes that trip this:
- transient RPC error during `get_block_by_number` (any `?` bubble)
- decoder bug on a new event variant
- DB connection blip
- worker panic inside ingestion::run that escapes the spawn closure
risk profile: sale ends, indexer dies mid-finalization, frontend shows stale podiums, nobody notices for hours.
fix shape: either (a) wrap ingestion::run in a backoff-retry loop that re-enters on Err, or (b) propagate the Err to main and let the process supervisor restart the binary. additionally surface `ingestion_alive: bool` and `last_indexed_at_ms: u64` in `/v1/status` so smoke tests can alert on staleness.
## Gap 2 — RPC calls have no per-call or transport-level timeout
`chain_timer.rs:138` and `ingestion.rs:92` both build providers without timeouts:
```
let provider = ProviderBuilder::new().on_http(url);
```
no `.timeout(...)` on the builder, no `tokio::time::timeout(...)` wrapper on any of the downstream calls. default reqwest behavior is no timeout. `chain_timer::poll_once` does 9 sequential `eth_call`s per tick (saleStart, deadline, timerCapSec, ended, podium(0..3)); a single hung call freezes the snapshot at the previous value. ingestion `get_block_by_number(next)` with no timeout blocks the entire indexer.
fix shape: wrap each RPC await in `tokio::time::timeout(Duration::from_secs(5), ...)` (or set a transport-level timeout via `reqwest::Client::builder().timeout(...)` and pipe through `ClientBuilder::on_provider`). also: `poll_once` 9 sequential calls would benefit from a multicall or parallel `try_join!`, but the timeout fix is the priority.
## Why these are coupled
gap 2 alone gives a stalled-but-alive ingestion task — observable through metrics if you look. gap 1 alone gives a clean failure path that should never trigger. together: RPC hangs → first call eventually times out at the OS level (minutes/hours), ingestion `?` bubbles → fail-silent log → indexer is now a stale-data zombie with no signal to operators.
closing either alone halves the risk. closing both makes the indexer self-recovering through normal ops.
## Priority
gap 1 is the single highest-impact launch-day fix — operator visibility into "is ingestion actually running" goes from zero to definitive. gap 2 is one-day defense-in-depth that mitigates the most common cause of gap 1 firing.
## Out of scope
this is distinct from #156 (production config fail-closed), #157 (raw DB error redaction), and the existing #138 children — those covered config + error-message hygiene, not runtime resilience of the ingestion task itself.
cc @PlasticDigits
issue