pgwatch full preset: per-relation metric SQLs scan system & extension schemas, blowing prometheus sample_limit on extension-heavy DBs (#190) · Issues · PostgresAI / postgresai

pgwatch full preset: per-relation metric SQLs scan system & extension schemas, blowing prometheus sample_limit on extension-heavy DBs

## Summary Nine SQL queries in `config/pgwatch-prometheus/metrics.yml`'s `full` preset iterate over every relation in `pg_class` / `pg_stat_all_*` / `pg_statio_all_*` **without excluding system or TimescaleDB-internal schemas**. On managed PostgreSQL services that ship many extensions (Supabase, Aiven, RDS with PostGIS, etc.) this inflates per-scrape sample count beyond `sample_limit: 10000`, and VictoriaMetrics drops the entire scrape. Refs #188 (item 4 — surfaced this against an extension-heavy Supabase). This issue is the focused, fix-shaped version of that item. ## Reproduction Stock Supabase project, default Supabase extensions only (23) — no PostGIS-tiger-style stress. After `postgresai mon local-install --tag 0.15.0-rc.3 --db-url <supabase-pooler-url>` against the project, VM target page: ``` pgwatch-prometheus down the response from "http://pgwatch-prometheus:9091/pgwatch" exceeds sample_limit=10000 ``` Probing `/pgwatch` repeatedly (bursts every ~30s): | probe | samples | |------:|--------:| | 1 | 2308 | | 2-5 | 3 | | 6 | **12 481** | | 7-10 | 3 | `mon health` reports green throughout — it doesn't validate VM is actually scraping pgwatch successfully. ## Root cause The burst is dominated by per-relation fan-out into schemas the user doesn't want to monitor. On the repro DB: | object | total | user-visible | |---|---:|---:| | `pg_stat_all_tables` | 242 | 76 | | `pg_stat_all_indexes` | 417 | ~150 | | `pg_class` relations | 382 | ~150 | Of the 242 tables, **98 are `pg_toast`**, **64 are `pg_catalog`**, **4 are `information_schema`** — none of which should appear in monitoring dashboards. With ~22 per-table metrics in `table_stats`, that's ~3 600 wasted samples per burst just from system schemas. The author **did write** the right exclusion pattern — it's present in `table_stats.q_root_part`: ```sql and not n.nspname like any (array[E'pg\\_%', 'information_schema', E'\\_timescaledb%']) ``` But it's applied to the **wrong CTE**. `q_root_part` only feeds the partition-root aggregation branch. The main `q_tstats` CTE (which produces the per-table rows) has no such filter, so every row from `pg_stat_all_tables` leaks through. The same omission exists in seven other queries: 1. `table_stats` (PG11 + PG16 variants) — filter only on `q_root_part`, missing from `q_tstats` 2. `pg_class` — `where nspname not in ('information_schema', 'pg_catalog')` misses `pg_toast`, `pg_temp_*`, `_timescaledb_*` 3. `pg_stat_all_indexes` — no filter at all 4. `pg_stat_all_tables` — no filter at all 5. `pg_total_relation_size` — no filter at all 6. `table_size_detailed` — `not in ('information_schema', 'pg_toast')` misses `pg_catalog`, `_timescaledb_*` 7. `pg_statio_all_tables` — no filter at all 8. `pg_statio_all_indexes` — no filter at all ## Fix Apply the canonical exclusion pattern in all nine SQLs: ```sql not <schema_column> like any (array[E'pg\\_%', 'information_schema', E'\\_timescaledb%']) ``` This matches `pg_catalog`, `pg_toast`, `pg_temp_*`, `information_schema`, `_timescaledb_*` — and does **not** match user-named schemas like `pgsodium`, `pgaudit`, or extension-installed application schemas (`auth`, `storage`, `tiger`, …) which users do want monitored. Add a regression test in `tests/compliance_vectors/` that asserts the pattern is present in each affected metric's SQL. ## Why not just raise `sample_limit`? Considered and rejected — see the discussion in MR. Short version: it doesn't fix the root cause (cardinality scaling with `pg_class` count, not user-table count), it weakens the safety net for actual cardinality explosions, and it forces customers to store and query ~3× the series indefinitely. ## What this is NOT - Not a Supabase-specific fix. Any managed Postgres with non-trivial extensions hits the same wall (Aiven w/ PostGIS, RDS, self-managed clusters with pg_partman/pgaudit/timescaledb internal schemas). - Not the answer to #188 item 4's other half — the broken-pipe / partial-scrape behaviour of pgwatch's prometheus sink under oversized responses is a separate concern, more easily exposed when this fix is in place and `sample_limit` is approached intentionally. - Not a tuning change to `sample_limit` itself — that stays at 10000 (and the existing compliance test in `test_mr219_monitoring_guards.py` keeps enforcing it). ## Acceptance - All nine SQLs include the canonical schema-filter pattern. - New regression test passes (and would fail if the pattern is removed from any of the nine). - Existing `test_mr219_monitoring_guards.py` still passes (sample_limit unchanged). - Validated on a live Supabase-monitored install: VM scrape target `pgwatch-prometheus` becomes `up` and stays up; `count(pgwatch_table_stats_table_size_b)` equals the target's `pg_stat_user_tables` count. ## Follow-ups (separate issues) - `mon health` should query `/api/v1/targets` on VictoriaMetrics and fail on any target whose `health != "up"`. As shipped, it greenlights an install where half the pipeline is broken — this issue would be caught at install time if it did. - pgwatch prometheus sink resilience to oversized scrapes / mid-stream disconnects (the broken-pipe part of #188 item 4).

issue