pgwatch-prometheus sink wipes metric cache after every scrape — long-poll metrics single-shot, lost on any scrape miss (#198) · Issues · PostgresAI / postgresai

pgwatch-prometheus sink wipes metric cache after every scrape — long-poll metrics single-shot, lost on any scrape miss

## Problem `internal/sinks/prometheus.go` in upstream pgwatch v3.6.0+ wipes the per-DB metric cache at the end of every `/pgwatch` Collect(): ```go promAsyncMetricCacheLock.Lock() promAsyncMetricCache[dbname] = make(map[string]metrics.MeasurementEnvelope) // clear the cache for this db after metrics are collected promAsyncMetricCacheLock.Unlock() ``` Introduced in upstream commit [`fb7abf39`](https://github.com/cybertec-postgresql/pgwatch/commit/fb7abf391bd4aae975ddd3b37b17f1b69b108c90) ([PR #790](https://github.com/cybertec-postgresql/pgwatch/pull/790)), shipped from v3.6.0 (2025-07) onward — so present in every postgresai release since 0.14.0 too. ## Why it's a data-loss hazard Because of the wipe, each sample collected by pgwatch is exposed on `/pgwatch` exactly **once** — on the next scrape after collection. If that one scrape doesn't reach storage (rejection, network blip, VM restart, deploy, scrape timeout, whatever) the sample is gone until the next poll cycle. For metrics with poll interval >> scrape interval this is brutal: | Metric | Poll interval | One missed scrape = lost data window | |---|---|---| | `pg_table_bloat` | 7200s (2h) | up to ~4h (this poll lost, next is 2h away) | | `pg_btree_bloat` | 7200s | ~4h | | `pg_invalid_indexes` | 7200s | ~4h | | `unused_indexes` | 7200s | ~4h | | `redundant_indexes` | 10800s (3h) | ~6h | | `rarely_used_indexes` | 10800s | ~6h | | `stats_reset` | 3600s (1h) | ~2h | | `settings` | 300s (5m) | ~10m | | `multixact_size` | 300s | ~10m | Without the wipe, each polled sample would stay in cache for the full 10-min `promScrapingStalenessHardDropLimit` window — giving VM up to 20 consecutive 30s-cadence scrapes to capture it instead of 1. ## Concrete impact today On `denis-postgresai-015-test` (Supabase target, see #195), `sample_limit` is currently rejecting ~97.5% of VM scrapes for unrelated cardinality reasons. The probability that the *one* scrape carrying a 7200s metric's sample lands successfully is therefore ~2.5%. So **roughly 97.5% of bloat / unused-index measurements on that install are permanently lost** — bloat history is intermittent and dangerously interpolated: one stale data point gets `last_over_time()`'d across hours of actual change. Even when #195's sizing is fixed and scrape success rate returns to ~100%, a *single* 30-second hiccup during a deploy still erases ~2 hours of bloat data. With the wipe gone, the same 30s hiccup costs nothing — 19 of the remaining scrapes in the 10-min staleness window still carry the sample. ## What the fix is Remove the 3-line wipe from `Collect()`. The cache then holds the latest sample per `(db, metric)` slot until overwritten by the next poll, or until the existing 10-min staleness guard drops it. Samples are emitted with their original collection epoch via `NewMetricWithTimestamp()`, so VM deduplicates repeats — re-emitting the same `(metric, timestamp)` across scrapes is a no-op storage-side. The 10-min staleness guard already covers the "collection stalled, stop emitting" case the wipe was presumably defending against. ## Distinction from #195 The drain mechanism was identified during investigation of #195, but it is **not** the user-visible root cause of #195. #195's symptom (mostly-empty Grafana on Supabase) is caused by per-scrape sample count exceeding `sample_limit: 10000` on every scrape — a cardinality-vs-cap problem, separate from this bug. Fixing the drain alone will not change Grafana on the #195 target; fixing #195's sizing alone will not fix the long-poll data-loss described here. They're two issues. ## Refs - Upstream regression: cybertec-postgresql/pgwatch#790 (shipped v3.6.0+) - Investigation context: #195 - Fix MR: !266 ## Acceptance 1. `/pgwatch` returns the latest sample per `(db, metric)` slot on every scrape (up to the existing 10-min staleness cutoff), not just on the first scrape after each poll. 2. With the fix applied, a single dropped scrape no longer costs an entire poll cycle of data for long-poll metrics. Verified by: poll a long-cycle metric (e.g. `pg_table_bloat`), drop the next scrape (e.g. via firewall / `sample_limit: 1`), confirm the metric is still present on the *following* scrape. 3. Upstream fix submitted to cybertec-postgresql/pgwatch as a separate PR.

issue