fix(metrics): extend !262 top-N + $other$ to remaining per-relation metrics (!267) · Merge requests · PostgresAI / postgresai

Summary

Port the gen2 "rank-then-cap + '$other$' aggregate row" pattern that !262 (merged) ships for the pg_stat_all_* / pg_statio_all_* family to the remaining per-relation metrics that were left out of !262 (merged)'s scope. On managed-Postgres targets with non-trivial table counts (e.g. Supabase free tier with 242 tables in non-system schemas) these were the dominant per-scrape sample-count contributors and were driving the ~one-per-30s exceeds sample_limit=10000 rejections documented in #195 (closed).

What this changes

config/pgwatch-prometheus/metrics.yml (production metrics) and tests/compliance_vectors/test_mr262_pgwatch_topn_integration.py (integration coverage extended to the new metrics) — no Go code, no Dockerfile, no other configs.

Metrics ported (10 metric blocks, 11 SQL variants):

Metric	Poll	Was	Now	Effective on 242-table Supabase
`table_stats` (PG11)	30s	`ORDER BY size LIMIT 300`, no aggregate	Top-100 + `'$other$'` ranked by `total_relation_size_b`	~4 840 → ~2 000 + other (largest single saving)
`table_stats` (PG16)	30s	same as PG11 + inner LIMIT 1500 pre-filter	same as PG11 + inner LIMIT 1500 retained (perf optimisation)	~4 840 → ~2 000 + other
`table_size_detailed`	30s	flat `LIMIT 1000`, no aggregate	Top-100 + `'$other$'` ranked by `total_relation_size_b`	~1 200 → ~500 + other
`pg_class`	30s	flat `LIMIT 10000`, no aggregate	Top-100 + `'$other$'` ranked by `relpages`	~500-1700 → ~200 + other
`pg_total_relation_size`	30s	flat `LIMIT 5000`, no aggregate	Top-100 + `'$other$'` ranked by total relation size	~242 → 100 + other
`pg_table_bloat`	7200s	flat `LIMIT 1000`, no aggregate	Top-100 + `'$other$'` ranked by `bloat_pct` (with `is_na = 0` preferred)	~2 400 → ~1 000 + other (once polled)
`pg_btree_bloat`	7200s	flat `LIMIT 1000`, no aggregate	Top-100 + `'$other$'` ranked by `bloat_pct`	~2 000 → ~1 000 + other (once polled)
`pg_invalid_indexes`	7200s	`row_number() over ()` (unstable) + flat `LIMIT 1000`	Top-100 + `'$other$'` ranked by `index_size_bytes`	per-DB (typically 0 on healthy clusters)
`redundant_indexes`	10800s	`row_number() over ()` inside CTE + flat `LIMIT 1000`	Top-100 + `'$other$'` ranked by `index_size_bytes` (preserves duplicate-alias output that dashboards rely on)	per-DB
`unused_indexes`	7200s	flat `LIMIT 1000`	Top-100 + `'$other$'` ranked by `index_size_bytes` within the `idx_scan=0 AND idx_is_btree` filter	per-DB
`rarely_used_indexes`	10800s	3-way UNION ALL of reason categories + `ORDER BY grp, size desc LIMIT 1000`	3-way UNION + Top-100 + `'$other$'` (combined; $other$ row's `tag_reason = '$other$'` to distinguish from real reason categories)	per-DB

Pattern (identical shape to !262 (merged)'s four metrics):

WITH ranked AS (
  SELECT row_number() OVER (ORDER BY <relevance> DESC NULLS LAST,
                            <stable tiebreaker(s)>) AS rownum,
         <source columns>
  FROM <source view(s) / sub-CTE>
  [WHERE <existing filters>]
)
SELECT <columns>
FROM ranked
WHERE rownum <= 100
UNION ALL
SELECT '$other$', ..., <aggregates>
FROM ranked
WHERE rownum > 100
GROUP BY ()
HAVING count(*) > 0

Aggregate semantics on the `'$other$'` row

Sum for size and counter columns (*_size_b, *_size_bytes, seq_scan, n_tup_*, vacuum_count, etc.)
Min for "seconds since" timestamps (most recent maintenance across the tail — not 0)
Max for tx_freeze_age / mxid_freeze_age (oldest still-alive xid is the risk)
Weighted-avg for bloat percentages: 100 * sum(greatest(relpages - est_pages, 0)) / sum(relpages) — equivalent to "what fraction of the tail's total page count is wasted"
Avg for fillfactor
Recomputed tag_*_cardinality_mb labels from the summed bytes
'$other$' literal for tag columns that can't aggregate (schema names, index definitions, etc.)
max(is_na) propagates "tail has unreliable rows" upward

The '$other$' sentinel starts with $, which is illegal as the first character of an unquoted Postgres identifier, so it cannot collide with a real schema/table/index name.

Integration test coverage

tests/compliance_vectors/test_mr262_pgwatch_topn_integration.py was extended (originally MR-262 scope, now also covers MR-267):

seeded_cur_mr267 fixture — seeds 110 user tables on a live cluster and adds the postgres_ai.pg_statistic helper view that the bloat metrics depend on.
test_pgwatch_topn_new_metrics_parse_and_cap + test_pgwatch_topn_new_metrics_other_bucket_present — for the 4 FULL_CROSS_CHECK metrics (pg_total_relation_size, pg_class, table_size_detailed, table_stats × PG11/PG16 variants): SQL parses, ≤101 rows, exactly one $other$ row, exactly 100 top-N rows.
test_pgwatch_topn_new_metrics_other_bucket_aggregates_tail — for the same 4 FULL_CROSS_CHECK metrics: top-N + $other$ per-column sum equals the source-view total (catches data loss in the tail aggregation).
test_pgwatch_topn_table_stats_other_bucket_min_max_semantics — runtime cross-check of min()/max() aggregate semantics on the table_stats $other$ row: seconds_since_last_vacuum / _analyze match independent tail min(); tx_freeze_age / mxid_freeze_age match independent tail max(). Catches a min/max swap that would invert "most recently maintained" or understate wraparound risk.
test_pgwatch_topn_new_metrics_syntax_only + test_pgwatch_topn_syntax_only_no_other_when_below_cap — for the 6 SYNTAX_ONLY metrics (bloat + index-list): SQL parses on a seeded cluster, schema-tag/numeric column names present (catches UNION ALL alias swaps), and HAVING count(*) > 0 correctly suppresses the $other$ row when the source set has ≤100 qualifying rows.

The MR-262 metric count assertion (len(METRIC_KEYS) == 4) is unchanged — !267 (merged) metrics are tested via the new parametrize sets above rather than added to that legacy constant.

Empirical validation

Each ported SQL was validated against a live Postgres 17 (target-db on denis-postgresai-015-test) before the test extension was written:

All parse cleanly with psql -v ON_ERROR_STOP=1.
pg_total_relation_size: 68 rows (under cap, no $other$ row emitted — HAVING count(*) > 0 correctly suppresses).
With cap deliberately tightened to 10 for validation, the $other$ row's coalesce(sum(bytes), 0) exactly equals the sum of the omitted tail (verified: top-10 sum + other = total sum).
table_stats PG11: 101 rows = 100 top + 1 $other$ .
table_stats PG16: 101 rows.
Bloat metrics use postgres_ai.pg_statistic (a pgwatch helper view not on the stock test target); substituting pg_stats for syntactic validation parses cleanly and returns 0 rows under the existing >1 MiB filter as expected.

End-to-end soak on a freshly-installed rc4 against Supabase (aws-0-eu-west-1.pooler.supabase.com, 242 tables): per-scrape sample count drops from ~12 100 (baseline, ~70% rejection rate) to 7 500–9 000 (1 000–2 500 of headroom under the 10 000 cap), zero exceeds sample_limit rejections over a 5-minute window. See the validation comment thread for full method + per-metric row counts.

What this does NOT do

Does not change sample_limit in config/prometheus/prometheus.yml. Whether the post-fix steady-state on real targets fits comfortably under 10 000 is a sizing question separate from the cap-shape work this MR does.
Does not address the drain bug (#198, MR !266). Those are independent — even after the drain wipe is removed, the per-relation fan-out on managed-Postgres targets would still exceed 10 000 without this MR's caps.

Test plan

Each ported SQL parses against PG17 (psql -v ON_ERROR_STOP=1)
Top-100 + $other$ row count correct on small targets (no $other$ when under cap)
$other$ aggregate sums verified equal to tail sum on pg_total_relation_size (validated with cap=10 against 68-row target)
YAML still parses (python3 -c "import yaml; yaml.safe_load(open(...))") after each commit
Integration test file extended to cover the 10 new metrics (FULL_CROSS_CHECK, SYNTAX_ONLY, tail-sum cross-check, HAVING-guard, and min/max semantics for table_stats)
Full test file (37 tests) passes against a live Postgres 18 cluster
End-to-end soak on rc4 against 242-table Supabase: per-scrape sample count drops from ~12 100 to 7 500–9 000, zero exceeds sample_limit rejections over 5+ min

Refs

Closes #195 (closed)
Builds on !262 (merged) (which introduced this pattern for pg_stat_all_* / pg_statio_all_*)
Independent of #198 (closed) / !266 (merged) (drain-wipe bug — orthogonal)

🤖 Generated with Claude Code

Edited May 20, 2026 by Denis Morozov

fix(metrics): extend !262 top-N + $other$ to remaining per-relation metrics