fix: pgwatch extension parsing, pg_stat_file permission, monitoring stack reliability (!219) · Merge requests · PostgresAI / postgresai

Summary

Fixes for monitoring managed PostgreSQL databases (Supabase, etc.) and monitoring stack reliability issues discovered in production.

1. pgwatch extension version parsing kills all metrics (Closes #141 (closed))

pgwatch's FetchRuntimeInfo scans pg_extension and parses versions with a regex that extracts only major.minor. For extensions like supabase-dbdev (version 0.0.4), this yields "0.0" → VersionToInt returns 0 → pgwatch treats it as invalid and aborts all metric gathering for the source.

Fix: Build pgwatch v3.7.0 from source with a one-line patch (pgwatch/Dockerfile) that skips unparseable extension versions instead of returning a fatal error.

2. multixact_size metric fails with permission denied (Closes #142 (closed))

The multixact_size metric checks if pg_stat_file() exists but not whether the monitoring user has EXECUTE privilege. On managed databases, the function exists but is restricted to superusers.

Fix: Add has_function_privilege('pg_stat_file(text,boolean)', 'execute') to the can_local CTE so the query falls through to the "not available" fallback instead of erroring.

3. self-cadvisor stays dead after crash/reboot (Relates to #148)

self-cadvisor was the only long-running service missing restart: unless-stopped. After a crash or host reboot, it stayed dead permanently — no container metrics in the self-monitoring dashboard.

Fix: Add restart: unless-stopped to the self-cadvisor service in docker-compose.yml.

4. Grafana pgss panels go blank when flask backend is down (Relates to #148)

All pg_stat_statements dashboard panels use a mandatory group_left() join with pgwatch_query_info. When the flask backend is down, the join returns empty results — ALL pgss panels blank even though raw data exists.

Fix: Add or fallback to all pgss queries in Dashboard_2 so panels show data (without display names) when query_info is unavailable.

5. /query_info_metrics times out on high-cardinality databases (Closes #143 (closed), Relates to #130)

PostgREST generates ~7K unique queryids/min. The pgss_queryid_queries table grows unboundedly (520K+ rows/hour), causing the flask /query_info_metrics endpoint to exceed the 30s scrape timeout.

Three-part fix:

Sink dedup trigger: Dropped the old parent unique index and changed enforce_queryid_uniqueness() to serialize by (dbname, queryid) with pg_advisory_xact_lock, UPDATE existing rows' time, data, and tag_data, and only insert when no row exists. This is partition-safe for pgwatch's dbname / time hierarchy and does not need an ON CONFLICT clause or pg_trigger_depth() recursion guard.
Time-filtered export: /query_info_metrics only exports queryids seen in last 10 minutes (QUERYID_ACTIVE_MINUTES env var)
Retention: Deletes entries older than 30 days (QUERYID_RETENTION_HOURS env var, default 720) in batches per scrape — bounded by the upstream calls >= 3 AND exec_time_total >= 1000 + LIMIT 100 filter on the heavy-hitter set

6. pgwatch exports ALL queryids as prometheus metrics — VictoriaMetrics OOM (Closes #143 (closed), Closes #130)

Root cause of the VictoriaMetrics OOM (3.9GB/4GB → killed). The pg_stat_statements metric SQL in pgwatch had no LIMIT — every queryid became a unique time series label. With PostgREST generating thousands of unique queryids, this created 16 million+ time series (1M+ per metric × 10+ pgss metrics), overwhelming VictoriaMetrics.

Fix:

Add ORDER BY exec_time_total DESC LIMIT 100 to both pg_stat_statements SQL variants (PG 11+ and PG 17+) — only the top 100 most time-consuming queries get exported as prometheus metrics
Add sample_limit: 10000 to the pgwatch prometheus scrape config as a safety net — any scrape producing >10000 samples gets rejected

Metric	Before	After
Series per pgss metric	1,000,000+	≤100
Total time series	16,000,000+	~16,000 (normal)
VictoriaMetrics memory	3.9GB/4GB (OOM)	<1GB
Flask /query_info_metrics	>30s (timeout)	<1s
pgss_queryid_queries table	Unbounded (520K+)	Capped by 30-day retention on the heavy-hitter set

Changes

pgwatch/Dockerfile — patched pgwatch build from v3.7.0 source
pgwatch/.dockerignore — keep build context lean
docker-compose.yml — patched pgwatch services + cadvisor restart policy + flask memory increase
config/pgwatch-prometheus/metrics.yml — permission check in multixact_size + LIMIT 100 on pg_stat_statements queries
config/prometheus/prometheus.yml — sample_limit: 10000 safety net for /pgwatch, plus query-info scrape limit notes
config/grafana/dashboards/Dashboard_2_Aggregated_query_analysis.json — resilient pgss queries with fallback arms
config/sink-postgres/init.sql — partition-safe advisory-lock dedup trigger refreshes active queryid rows and drops the incompatible parent unique index
monitoring_flask_backend/app.py — time-filtered metrics export, retention cleanup, lazy first-request trigger verification
monitoring_flask_backend/test_app.py — tests for trigger verification, retention loop, advisory locks, and isolated fixture state

Test plan

https://claude.ai/code/session_01SzJxzZNQjDQphaHyaX3RU7

Edited Apr 24, 2026 by Nikolay Samokhvalov

fix: pgwatch extension parsing, pg_stat_file permission, monitoring stack reliability