fix: filter pg_stat_statements by calls AND exec_time to cut long tail
Problem
The pg_stat_statements metrics in both config/pgwatch-prometheus/metrics.yml and config/pgwatch-postgres/metrics.yml export every queryid from pg_stat_statements with no filtering.
On databases with high query diversity (e.g., Supabase with inngest trace comments creating unique queryids), this produces hundreds of thousands of unique queryids. Observed on production:
-
948,916 unique queryids in
pgss_queryid_queriestable (1.3 GB) - Flask backend (
/query_info_metrics) OOM crash loop every ~5 minutes trying to process all queryids - VictoriaMetrics unable to scrape query info → Dashboard 2 panels empty
-
pgwatch_query_infometric never populated
Fix
Add WHERE calls >= 3 AND total_exec_time >= 1000 filter to both metrics files:
-
Prometheus sink (
pgwatch-prometheus/metrics.yml): filters oncalls >= 3 AND exec_time_total >= 1000for both PG 11 and PG 17 SQL variants -
PostgreSQL sink (
pgwatch-postgres/metrics.yml): filters oncalls >= 3 AND total_exec_time >= 1000
Using AND (not OR) ensures only queries that are both frequently called AND have significant execution time are exported, cutting the long tail of one-off or trivial queries.
Note: Previous approach used LIMIT 100 which is replaced by this threshold-based filter. The threshold approach is preferred because it adapts to the actual workload rather than imposing an arbitrary cap.
Production validation
Applied on production (34.203.111.251) on 2026-02-16:
- Truncated
pgss_queryid_queriestable (cleared 948K stale rows) - Patched both metrics.yml files
- Restarted pgwatch-postgres, pgwatch-prometheus, flask backend
- Result: table repopulated with 851 rows (down from 948,916), flask backend stable (no more OOM kills)