Per-relation metrics outside !262's scope overrun `sample_limit: 10000` — sparse Grafana on managed-Postgres targets (#195) · Issues · PostgresAI / postgresai

Per-relation metrics outside !262's scope overrun `sample_limit: 10000` — sparse Grafana on managed-Postgres targets

## TL;DR On a stock `0.15.0-rc.4` install against Supabase (Postgres 17.6, default Supabase extensions), per-scrape sample count from `pgwatch-prometheus`'s `/pgwatch` endpoint sits comfortably above `sample_limit: 10000`. VictoriaMetrics rejects almost every scrape (`12 744` `exceeds sample_limit` log lines over 4 days = ~one per 30s scrape), so dashboards land at most a handful of bars per hour. The cardinality comes from per-relation metrics outside !262's scope: !262 ported the gen2 "rank-then-cap + `$other$`" pattern to four metrics (`pg_stat_all_tables/indexes`, `pg_statio_all_tables/indexes`), but the remaining per-relation metrics (`table_stats`, `table_size_detailed`, `pg_class`, `pg_total_relation_size`, `pg_table_bloat`, `pg_btree_bloat`, `unused_indexes`, `redundant_indexes`, `rarely_used_indexes`, `pg_invalid_indexes`) are either uncapped or use the older flat-`LIMIT`-no-aggregate-row pattern !262 was replacing. On a 242-table target the flat LIMITs don't truncate at all, so the full fan-out lands every scrape. `mon health` greenlights the install throughout. --- ## Symptom Fresh `mon local-install --tag 0.15.0-rc.4` against a stock Supabase project (eu-west-1, Postgres 17.6, default Supabase extensions). All containers `Up 3+ days`, `mon health` reports green, pgbench load running and visible on the source (`SELECT count(*) FROM pg_stat_activity` confirms ~12 sessions for the monitor role). **Grafana dashboards remain mostly empty**, showing only sparse, isolated bars rather than continuous series. Example "Last 1 hour" view of "01. Single node performance overview (high-level)" — `TPS`, `QPS`, `Query total time`, `Active session history` panels — has 2-3 isolated bars and large empty gaps across the hour. `docker logs sink-prometheus` shows the cause every 30s: ``` 2026-05-19T20:32:17.284Z warn cannot scrape target "http://pgwatch-prometheus:9091/pgwatch" ... the response from ... exceeds sample_limit=10000; either reduce the sample count for the target or increase sample_limit ``` Oldest entry `2026-05-15T21:11:17` — 11s after VictoriaMetrics first started. `11 219` rejections in the first 4 days, `12 744` by day 5, continuous every scrape interval. So the rare scrapes that do land (2-3 isolated bars per hour ≈ ~2.5% of scrapes) are the few that happened to dip just below 10 000 because a couple of 30s-poll metrics happened to miss their window in a given 30s span. ## Root cause — per-relation cardinality outside !262's scope !262 ported the gen2 "rank-then-cap + `$other$` aggregate row" pattern to four per-relation metrics: - `pg_stat_all_tables` - `pg_stat_all_indexes` - `pg_statio_all_tables` - `pg_statio_all_indexes` That cap is working as designed. But several other per-relation metrics — all enabled by the `full` preset — either have no cap or use the older flat-`LIMIT`-no-aggregate-row pattern that !262 was specifically replacing for the four above. On a 242-table target like Supabase, the flat LIMITs don't truncate anything because they're set well above the table count, so the full per-relation fan-out lands every scrape: | Metric | Poll | What it has today | Effective samples on 242 tables | |---|---|---|---| | `table_stats` | 30s | **nothing** (schema filter only) | ~4 840 (242 × ~20 cols) — biggest single contributor | | `pg_total_relation_size` | 30s | flat `LIMIT 5000`, no aggregate | ~242 | | `table_size_detailed` | 30s | flat `LIMIT 1000`, no aggregate | ~1 200 | | `pg_class` | 30s | flat `LIMIT 10000`, no aggregate | ~500–1 700 | | `pg_table_bloat` | 7200s | flat `LIMIT 1000`, no aggregate | ~2 400 | | `pg_btree_bloat` | 7200s | (varies — needs full SQL audit per-metric) | ~2 000 | | `unused_indexes` | 7200s | (varies) | per-index fan-out | | `redundant_indexes` | 10800s | mixed (`row_number()` ranking present but no `$other$`) | per-index fan-out | | `rarely_used_indexes` | 10800s | (varies) | per-index fan-out | | `pg_invalid_indexes` | 7200s | `row_number()` + flat `LIMIT 1000`, no aggregate | per-index fan-out | | `pg_stat_statements` | 30s | top-100 by exec time (good — capped) | ~1 500 | Ballpark sum across the full preset on this target: ~15 000–17 500 per-scrape samples. Well above the 10 000 cap → most scrapes rejected. ## Timeline — this is an install-time, all-scrapes failure `12 744` `exceeds sample_limit` rejections continuously since 11s after VictoriaMetrics first started. `pgwatch-prometheus` has had exactly 1 restart in the 5-day window (2026-05-19 14:18 UTC, unrelated to scrape state). There has been no point in this instance's lifetime where scrapes were not being routinely rejected. The "sparse, isolated bars" on dashboards have always been the rare scrapes that dipped just under 10k — easily mistaken for "normal but quiet." ## What this issue is NOT A separate concern surfaced during this investigation — the Prometheus sink wipes its per-DB cache after every scrape, which causes long-poll metrics (bloat, unused_indexes, etc.) to become single-shot and vulnerable to permanent loss on any scrape miss. That's a **real but separate** bug, tracked in #198 with its own fix MR (!266). It is **not** the cause of this issue's user-visible symptom: in normal VM-only 30s operation, the cache has the full 30s to fill before each scrape, so each scrape is at near-steady-state cardinality already — the empty-scrape behaviour I previously reported was an artifact of probing at 1-2s intervals, faster than the polls could refill the cache. See [this comment](https://gitlab.com/postgres-ai/postgresai/-/work_items/195#note_3367854184) for the full retraction. ## Distinction from #190 / !262 This is **not** the same as #190 (closed by !262, "port pgwatch postgresai edition's top-N + 'other' bucket to pg_stat/statio_all_*"). I verified !262 is shipped in rc.4 — the four metrics it targets carry the `$other$` aggregate row and the gen2 rank-then-cap pattern. This issue is the **follow-up** !262 left implicit: extend the same pattern to the per-relation metrics !262 didn't touch. ## Reproduction 1. `postgresai mon local-install --tag 0.15.0-rc.4 --db-url <supabase-pooler-url>` against a stock Supabase project (free tier sufficient — 242 tables in non-system schemas is enough cardinality). 2. Generate any sustained load on the source (e.g., `pgbench -n -c 5 -j 2 -T 600 -f read_only_script.sql <conn>`). 3. Open Grafana → "01. Single node performance overview (high-level)" → "Last 1 hour". Expect: continuous bars for TPS/QPS/active sessions. Actual: 2-3 isolated bars, large empty gaps. 4. `docker logs sink-prometheus | grep -c "exceeds sample_limit"` — should be 0. Actual: one log line every 30s, present from ~11s after the stack first came up. 5. `docker exec sink-prometheus wget -qO- http://pgwatch-prometheus:9091/pgwatch | grep -v '^#' | wc -l` from inside the docker network — confirms each scrape's payload is ~12k+ samples in steady state. ## Why `mon health` doesn't catch this `mon health` reports the install green throughout — both at install time and 5 days later, despite ~12 744 consecutive rejected scrapes. The dashboards always have *something* on them (the rare landing scrapes), so a casual look reads as "monitoring is up, just quiet"; only a deliberate load test against the source DB surfaces that the dashboard's quietness doesn't track the source's activity. `mon health` should be querying VictoriaMetrics' `/api/v1/targets` and failing on any target with `health != "up"` — the rejected-scrape state would be visible there and caught at install time. (This is a separate hardening item, deserves its own issue — flagging it here for tracking.) ## Environment - Container images: `postgresai/pgwatch:0.15.0-rc.4`, `postgresai/monitoring-flask-backend:0.15.0-rc.4`, `postgresai/reporter:0.15.0-rc.4`, `victoriametrics/victoria-metrics:v1.140.0` - Source DB: Supabase free tier, `aws-0-eu-west-1.pooler.supabase.com:5432/postgres`, Postgres 17.6 on aarch64, 23 default Supabase extensions - Source DB cardinality: 76 user tables, 242 all-schema tables (after `pg_toast`/`pg_catalog`/`information_schema` exclusions in queries that have them), 859 `pg_class` entries total - `prometheus.yml`: `scrape_interval: 30s`, `sample_limit: 10000`, `scrape_timeout: 25s` (unchanged from shipped defaults) ## Refs - Closes part of follow-up from #190 ("pgwatch prometheus sink resilience to oversized scrapes") - Related: #188 (Supabase dogfooding bugs umbrella) - Related: #194 (broader Supabase monitoring gaps) - Related: #143 (pg_stat_statements unlimited queryids — capped via top-100, working) - Separate bug: #198 (drain wipe — long-poll metrics single-shot), MR !266 ## Acceptance 1. Per-scrape sample count from `/pgwatch` does not exceed `sample_limit: 10000` on a stock Supabase install (or whatever the cap is set to after this work). Verified by `docker exec sink-prometheus wget -qO- http://pgwatch-prometheus:9091/pgwatch | grep -v '^#' | wc -l` returning a value comfortably under the cap on every scrape. 2. The remaining per-relation metrics (`table_stats`, `table_size_detailed`, `pg_class`, `pg_total_relation_size`, `pg_table_bloat`, `pg_btree_bloat`, `unused_indexes`, `redundant_indexes`, `rarely_used_indexes`, `pg_invalid_indexes`) carry the gen2 "rank-then-cap + `$other$` aggregate row" pattern that !262 already ships for the `pg_stat_all_*` / `pg_statio_all_*` family. Same SQL shape, same `$other$` sentinel, same `HAVING count(*) > 0` suppression. 3. Compliance-vector tests pin the pattern on every per-relation metric (extends the tests !262 added). 4. Integration test executes each rewritten SQL against a real PostgreSQL and verifies (a) capped row count, (b) `$other$` row present when truncation kicks in, suppressed otherwise. 5. (Hardening, optional) `mon health` fails install if any VM scrape target has `health != "up"` for more than one scrape interval. Belongs in its own issue if scope here is too big.

issue