Indexer resilience: ingestion task fails silently + RPC calls have no timeouts (launch-day risk) (#168) · Issues · Plastic Digits / yieldomega

Indexer resilience: ingestion task fails silently + RPC calls have no timeouts (launch-day risk)

two tightly-coupled indexer resilience gaps surfaced during a code review pass against `390f9be`. both are launch-day failure modes that compound — RPC hang triggers ingestion silent exit, ingestion silent exit means HTTP API serves stale data forever, no path to recover without process restart. ## Gap 1 — ingestion task fail-silent in `main.rs:31-35` spawned task swallows errors with just `tracing::error!` and exits cleanly: ``` let ingestion_handle = tokio::spawn(async move { if let Err(e) = ingestion::run(&ingest_pool, &ingest_config).await { tracing::error!(?e, "ingestion failed"); } }); ``` after exit, the HTTP API keeps responding 200 OK to all `/v1/*` reads, just from frozen state. `ingestion_handle` is not awaited or supervised. `/v1/status` exposes `max_indexed_block` but no client reads it as a liveness check, and there is no `ingestion_alive: bool` or `last_indexed_at_ms` to alert against. failure modes that trip this: - transient RPC error during `get_block_by_number` (any `?` bubble) - decoder bug on a new event variant - DB connection blip - worker panic inside ingestion::run that escapes the spawn closure risk profile: sale ends, indexer dies mid-finalization, frontend shows stale podiums, nobody notices for hours. fix shape: either (a) wrap ingestion::run in a backoff-retry loop that re-enters on Err, or (b) propagate the Err to main and let the process supervisor restart the binary. additionally surface `ingestion_alive: bool` and `last_indexed_at_ms: u64` in `/v1/status` so smoke tests can alert on staleness. ## Gap 2 — RPC calls have no per-call or transport-level timeout `chain_timer.rs:138` and `ingestion.rs:92` both build providers without timeouts: ``` let provider = ProviderBuilder::new().on_http(url); ``` no `.timeout(...)` on the builder, no `tokio::time::timeout(...)` wrapper on any of the downstream calls. default reqwest behavior is no timeout. `chain_timer::poll_once` does 9 sequential `eth_call`s per tick (saleStart, deadline, timerCapSec, ended, podium(0..3)); a single hung call freezes the snapshot at the previous value. ingestion `get_block_by_number(next)` with no timeout blocks the entire indexer. fix shape: wrap each RPC await in `tokio::time::timeout(Duration::from_secs(5), ...)` (or set a transport-level timeout via `reqwest::Client::builder().timeout(...)` and pipe through `ClientBuilder::on_provider`). also: `poll_once` 9 sequential calls would benefit from a multicall or parallel `try_join!`, but the timeout fix is the priority. ## Why these are coupled gap 2 alone gives a stalled-but-alive ingestion task — observable through metrics if you look. gap 1 alone gives a clean failure path that should never trigger. together: RPC hangs → first call eventually times out at the OS level (minutes/hours), ingestion `?` bubbles → fail-silent log → indexer is now a stale-data zombie with no signal to operators. closing either alone halves the risk. closing both makes the indexer self-recovering through normal ops. ## Priority gap 1 is the single highest-impact launch-day fix — operator visibility into "is ingestion actually running" goes from zero to definitive. gap 2 is one-day defense-in-depth that mitigates the most common cause of gap 1 firing. ## Out of scope this is distinct from #156 (production config fail-closed), #157 (raw DB error redaction), and the existing #138 children — those covered config + error-message hygiene, not runtime resilience of the ingestion task itself. cc @PlasticDigits

issue