fix(pgwatch): keep Prometheus sink cache across scrapes

Summary

Patch upstream pgwatch v3.7.0 to remove the per-DB cache wipe at the end of every Prometheus scrape. The wipe was introduced upstream in commit fb7abf39 (v3.6.0, 2025-07, PR #790) and has shipped in every postgresai release that uses upstream pgwatch ≥ v3.6.0.

The wipe means each polled sample is exposed on /pgwatch exactly once — on whichever scrape comes between collection and the next Collect(). If that one scrape fails to land in VictoriaMetrics for any reason, the sample is permanently lost until the next poll cycle.

For metrics polled at intervals far longer than the 30s scrape interval, this is a real data-loss bug, not a cosmetic one.

Why it matters — data loss, not aesthetics

Metric Poll interval One missed scrape = lost data window
pg_table_bloat 7200s (2h) up to ~4h
pg_btree_bloat 7200s ~4h
pg_invalid_indexes 7200s ~4h
unused_indexes 7200s ~4h
redundant_indexes 10800s (3h) ~6h
rarely_used_indexes 10800s ~6h
stats_reset 3600s (1h) ~2h
settings 300s (5m) ~10m
multixact_size 300s ~10m

Things that will, in normal operation, cost a single scrape and therefore a full poll cycle worth of these metrics:

  • VM restart or deploy of the monitoring stack
  • Network blip between VM and pgwatch
  • pgwatch HTTP timeout on a slow scrape (default scrape_timeout 25s)
  • Any other scrape that gets rejected — e.g. on a managed-Postgres-shaped target where total cardinality exceeds sample_limit, almost every scrape is rejected (see #195 (closed)) and these long-poll samples are lost ~97.5% of the time

With the wipe gone, each polled sample stays in cache for the full 10-min promScrapingStalenessHardDropLimit — up to 20 consecutive 30s-cadence scrapes get a chance to capture it. A single 30s hiccup costs zero data for these metrics instead of 2-6 hours.

What this changes

pgwatch/Dockerfile gains a second sed-based patch (alongside the existing extension-parsing patch) that removes the 3-line wipe block from internal/sinks/prometheus.go and renames the now-unused dbname loop variable to _ to satisfy Go's strict unused-variable check. The patch is defensive — greps for the targets before sed'ing and verifies they're gone afterward, so any upstream restructuring fails the build loudly rather than silently producing an unpatched binary.

Behaviorally: after the patch, /pgwatch serves the latest sample per (db, metric) slot on every scrape. Samples are emitted with their original collection epoch via NewMetricWithTimestamp(), so VM deduplicates repeats by timestamp at storage time — re-emitting the same (metric, timestamp) across scrapes is a no-op. The existing 10-min staleness guard in MetricStoreMessageToPromMetrics already covers the "collection stalled, stop emitting" case the wipe was presumably defending against, so removing the wipe doesn't open the door to indefinite stale-data emission.

What this does NOT do

  • Does not address #195 (closed). That issue's user-visible symptom (mostly-empty Grafana on Supabase) is caused by per-scrape sample count exceeding sample_limit: 10000 — a cardinality-vs-cap problem, separate from this bug. Fixing the drain does not change Grafana behavior on that target. #195 (closed)'s investigation surfaced the drain bug, hence the historical comments there, but the two fixes are independent. See #198 (closed) for the focused drain-fix issue this MR closes.
  • Does not change sample_limit in config/prometheus/prometheus.yml. After this fix, per-scrape sample count goes up on most targets (long-poll metrics that were previously single-shot are now always present). Whether 10 000 is the right cap is a separate sizing decision tied to #195 (closed), not gated on this MR.
  • Does not extend !262 (merged)'s top-N + $other$ cap to the remaining uncapped per-relation metrics (table_stats, table_size_detailed, pg_class, pg_total_relation_size, the bloat metrics, unused/redundant/rarely-used indexes). That's the right next step for #195 (closed) but is out of scope here.
  • Does not push the fix upstream. Belongs as a separate PR to cybertec-postgresql/pgwatch; tracked as acceptance #3 on #198 (closed).
  • Does not add an automated test. Verifying the post-fix behavior (long-poll metric still present on the scrape after a deliberate drop) requires harness work that doesn't exist in tests/ yet; tracked as acceptance #2 on #198 (closed).

Test plan

  • Local image build succeeds (docker build -f pgwatch/Dockerfile pgwatch/)
  • Patched binary boots (docker run --rm postgresai/pgwatch:0.15.0-rc.4-drain-fix --help)
  • Patched source verified by inspection inside the builder stage: wipe block removed, loop header rewritten, all other promAsyncMetricCacheLock.Lock/Unlock pairs intact
  • Reviewer: deploy the patched image to a stack with a long-poll metric enabled (e.g. pg_table_bloat). After a poll lands, drop the next scrape (e.g. with sample_limit: 1 temporarily, or iptables block on port 9091). Confirm the bloat sample is still present on the following scrape.

Refs

  • Closes #198 (closed) (drain-fix issue)
  • Related: #195 (closed) (investigation context; not closed by this MR)
  • Upstream regression: cybertec-postgresql/pgwatch#790 (shipped v3.6.0+)

🤖 Generated with Claude Code

Edited by Denis Morozov

Merge request reports

Loading