fix: pgwatch extension parsing, pg_stat_file permission, monitoring stack reliability
Summary
Fixes for monitoring managed PostgreSQL databases (Supabase, etc.) and monitoring stack reliability issues discovered in production.
1. pgwatch extension version parsing kills all metrics (Closes #141 (closed))
pgwatch's FetchRuntimeInfo scans pg_extension and parses versions with a regex that extracts only major.minor. For extensions like supabase-dbdev (version 0.0.4), this yields "0.0" → VersionToInt returns 0 → pgwatch treats it as invalid and aborts all metric gathering for the source.
Fix: Build pgwatch v3.7.0 from source with a one-line patch (pgwatch/Dockerfile) that skips unparseable extension versions instead of returning a fatal error.
2. multixact_size metric fails with permission denied (Closes #142 (closed))
The multixact_size metric checks if pg_stat_file() exists but not whether the monitoring user has EXECUTE privilege. On managed databases, the function exists but is restricted to superusers.
Fix: Add has_function_privilege('pg_stat_file(text,boolean)', 'execute') to the can_local CTE so the query falls through to the "not available" fallback instead of erroring.
3. self-cadvisor stays dead after crash/reboot (Relates to #148)
self-cadvisor was the only long-running service missing restart: unless-stopped. After a crash or host reboot, it stayed dead permanently — no container metrics in the self-monitoring dashboard.
Fix: Add restart: unless-stopped to the self-cadvisor service in docker-compose.yml.
4. Grafana pgss panels go blank when flask backend is down (Relates to #148)
All pg_stat_statements dashboard panels use a mandatory group_left() join with pgwatch_query_info. When the flask backend is down, the join returns empty results — ALL pgss panels blank even though raw data exists.
Fix: Add or fallback to all pgss queries in Dashboard_2 so panels show data (without display names) when query_info is unavailable.
5. /query_info_metrics times out on high-cardinality databases (Closes #143 (closed), Relates to #130)
PostgREST generates ~7K unique queryids/min. The pgss_queryid_queries table grows unboundedly (520K+ rows/hour), causing the flask /query_info_metrics endpoint to exceed the 30s scrape timeout.
Three-part fix:
- Sink dedup trigger: Dropped the old parent unique index and changed
enforce_queryid_uniqueness()to serialize by(dbname, queryid)withpg_advisory_xact_lock,UPDATEexisting rows'time,data, andtag_data, and only insert when no row exists. This is partition-safe for pgwatch'sdbname/timehierarchy and does not need anON CONFLICTclause orpg_trigger_depth()recursion guard. - Time-filtered export:
/query_info_metricsonly exports queryids seen in last 10 minutes (QUERYID_ACTIVE_MINUTESenv var) - Retention: Deletes entries older than 30 days (
QUERYID_RETENTION_HOURSenv var, default720) in batches per scrape — bounded by the upstreamcalls >= 3 AND exec_time_total >= 1000+LIMIT 100filter on the heavy-hitter set
6. pgwatch exports ALL queryids as prometheus metrics — VictoriaMetrics OOM (Closes #143 (closed), Closes #130)
Root cause of the VictoriaMetrics OOM (3.9GB/4GB → killed). The pg_stat_statements metric SQL in pgwatch had no LIMIT — every queryid became a unique time series label. With PostgREST generating thousands of unique queryids, this created 16 million+ time series (1M+ per metric × 10+ pgss metrics), overwhelming VictoriaMetrics.
Fix:
- Add
ORDER BY exec_time_total DESC LIMIT 100to both pg_stat_statements SQL variants (PG 11+ and PG 17+) — only the top 100 most time-consuming queries get exported as prometheus metrics - Add
sample_limit: 10000to the pgwatch prometheus scrape config as a safety net — any scrape producing >10000 samples gets rejected
| Metric | Before | After |
|---|---|---|
| Series per pgss metric | 1,000,000+ | ≤100 |
| Total time series | 16,000,000+ | ~16,000 (normal) |
| VictoriaMetrics memory | 3.9GB/4GB (OOM) | <1GB |
| Flask /query_info_metrics | >30s (timeout) | <1s |
| pgss_queryid_queries table | Unbounded (520K+) | Capped by 30-day retention on the heavy-hitter set |
Changes
pgwatch/Dockerfile— patched pgwatch build from v3.7.0 sourcepgwatch/.dockerignore— keep build context leandocker-compose.yml— patched pgwatch services + cadvisor restart policy + flask memory increaseconfig/pgwatch-prometheus/metrics.yml— permission check in multixact_size + LIMIT 100 on pg_stat_statements queriesconfig/prometheus/prometheus.yml— sample_limit: 10000 safety net for/pgwatch, plus query-info scrape limit notesconfig/grafana/dashboards/Dashboard_2_Aggregated_query_analysis.json— resilient pgss queries with fallback armsconfig/sink-postgres/init.sql— partition-safe advisory-lock dedup trigger refreshes active queryid rows and drops the incompatible parent unique indexmonitoring_flask_backend/app.py— time-filtered metrics export, retention cleanup, lazy first-request trigger verificationmonitoring_flask_backend/test_app.py— tests for trigger verification, retention loop, advisory locks, and isolated fixture state
Test plan
- Build patched pgwatch image:
docker build -t postgresai/pgwatch:3-patched ./pgwatch - Deploy to a monitoring instance targeting a Supabase/PostgREST database
- Verify no
unexpected extensionerrors in pgwatch logs - Verify no
permission denied for function pg_stat_fileerrors - Verify pgwatch exports ≤100 queryids per scrape (
curl pgwatch:9091/pgwatch | grep -c pg_stat_statements_calls) - Verify VictoriaMetrics total series stays in normal range (<50K)
- Verify
/query_info_metricsresponds in <5s - Verify
pgss_queryid_queriestable stays bounded (retention cleanup working) - Verify Dashboard 2 shows query display names
- Stop
flask-pgss-api— verify pgss panels still show data (without display names) - Restart host — verify
self-cadvisorcomes back automatically