fix: pgwatch extension parsing, pg_stat_file permission, monitoring stack reliability

Summary

Fixes for monitoring managed PostgreSQL databases (Supabase, etc.) and monitoring stack reliability issues discovered in production.

1. pgwatch extension version parsing kills all metrics (Closes #141 (closed))

pgwatch's FetchRuntimeInfo scans pg_extension and parses versions with a regex that extracts only major.minor. For extensions like supabase-dbdev (version 0.0.4), this yields "0.0"VersionToInt returns 0 → pgwatch treats it as invalid and aborts all metric gathering for the source.

Fix: Build pgwatch v3.7.0 from source with a one-line patch (pgwatch/Dockerfile) that skips unparseable extension versions instead of returning a fatal error.

2. multixact_size metric fails with permission denied (Closes #142 (closed))

The multixact_size metric checks if pg_stat_file() exists but not whether the monitoring user has EXECUTE privilege. On managed databases, the function exists but is restricted to superusers.

Fix: Add has_function_privilege('pg_stat_file(text,boolean)', 'execute') to the can_local CTE so the query falls through to the "not available" fallback instead of erroring.

3. self-cadvisor stays dead after crash/reboot (Relates to #148)

self-cadvisor was the only long-running service missing restart: unless-stopped. After a crash or host reboot, it stayed dead permanently — no container metrics in the self-monitoring dashboard.

Fix: Add restart: unless-stopped to the self-cadvisor service in docker-compose.yml.

4. Grafana pgss panels go blank when flask backend is down (Relates to #148)

All pg_stat_statements dashboard panels use a mandatory group_left() join with pgwatch_query_info. When the flask backend is down, the join returns empty results — ALL pgss panels blank even though raw data exists.

Fix: Add or fallback to all pgss queries in Dashboard_2 so panels show data (without display names) when query_info is unavailable.

5. /query_info_metrics times out on high-cardinality databases (Closes #143 (closed), Relates to #130)

PostgREST generates ~7K unique queryids/min. The pgss_queryid_queries table grows unboundedly (520K+ rows/hour), causing the flask /query_info_metrics endpoint to exceed the 30s scrape timeout.

Three-part fix:

  • Sink dedup trigger: Dropped the old parent unique index and changed enforce_queryid_uniqueness() to serialize by (dbname, queryid) with pg_advisory_xact_lock, UPDATE existing rows' time, data, and tag_data, and only insert when no row exists. This is partition-safe for pgwatch's dbname / time hierarchy and does not need an ON CONFLICT clause or pg_trigger_depth() recursion guard.
  • Time-filtered export: /query_info_metrics only exports queryids seen in last 10 minutes (QUERYID_ACTIVE_MINUTES env var)
  • Retention: Deletes entries older than 30 days (QUERYID_RETENTION_HOURS env var, default 720) in batches per scrape — bounded by the upstream calls >= 3 AND exec_time_total >= 1000 + LIMIT 100 filter on the heavy-hitter set

6. pgwatch exports ALL queryids as prometheus metrics — VictoriaMetrics OOM (Closes #143 (closed), Closes #130)

Root cause of the VictoriaMetrics OOM (3.9GB/4GB → killed). The pg_stat_statements metric SQL in pgwatch had no LIMIT — every queryid became a unique time series label. With PostgREST generating thousands of unique queryids, this created 16 million+ time series (1M+ per metric × 10+ pgss metrics), overwhelming VictoriaMetrics.

Fix:

  • Add ORDER BY exec_time_total DESC LIMIT 100 to both pg_stat_statements SQL variants (PG 11+ and PG 17+) — only the top 100 most time-consuming queries get exported as prometheus metrics
  • Add sample_limit: 10000 to the pgwatch prometheus scrape config as a safety net — any scrape producing >10000 samples gets rejected
Metric Before After
Series per pgss metric 1,000,000+ ≤100
Total time series 16,000,000+ ~16,000 (normal)
VictoriaMetrics memory 3.9GB/4GB (OOM) <1GB
Flask /query_info_metrics >30s (timeout) <1s
pgss_queryid_queries table Unbounded (520K+) Capped by 30-day retention on the heavy-hitter set

Changes

  • pgwatch/Dockerfile — patched pgwatch build from v3.7.0 source
  • pgwatch/.dockerignore — keep build context lean
  • docker-compose.yml — patched pgwatch services + cadvisor restart policy + flask memory increase
  • config/pgwatch-prometheus/metrics.yml — permission check in multixact_size + LIMIT 100 on pg_stat_statements queries
  • config/prometheus/prometheus.ymlsample_limit: 10000 safety net for /pgwatch, plus query-info scrape limit notes
  • config/grafana/dashboards/Dashboard_2_Aggregated_query_analysis.json — resilient pgss queries with fallback arms
  • config/sink-postgres/init.sql — partition-safe advisory-lock dedup trigger refreshes active queryid rows and drops the incompatible parent unique index
  • monitoring_flask_backend/app.py — time-filtered metrics export, retention cleanup, lazy first-request trigger verification
  • monitoring_flask_backend/test_app.py — tests for trigger verification, retention loop, advisory locks, and isolated fixture state

Test plan

  • Build patched pgwatch image: docker build -t postgresai/pgwatch:3-patched ./pgwatch
  • Deploy to a monitoring instance targeting a Supabase/PostgREST database
  • Verify no unexpected extension errors in pgwatch logs
  • Verify no permission denied for function pg_stat_file errors
  • Verify pgwatch exports ≤100 queryids per scrape (curl pgwatch:9091/pgwatch | grep -c pg_stat_statements_calls)
  • Verify VictoriaMetrics total series stays in normal range (<50K)
  • Verify /query_info_metrics responds in <5s
  • Verify pgss_queryid_queries table stays bounded (retention cleanup working)
  • Verify Dashboard 2 shows query display names
  • Stop flask-pgss-api — verify pgss panels still show data (without display names)
  • Restart host — verify self-cadvisor comes back automatically

https://claude.ai/code/session_01SzJxzZNQjDQphaHyaX3RU7

Edited by Nikolay Samokhvalov

Merge request reports

Loading