fix: filter pg_stat_statements by calls AND exec_time to cut long tail

Problem

The pg_stat_statements metrics in both config/pgwatch-prometheus/metrics.yml and config/pgwatch-postgres/metrics.yml export every queryid from pg_stat_statements with no filtering.

On databases with high query diversity (e.g., Supabase with inngest trace comments creating unique queryids), this produces hundreds of thousands of unique queryids. Observed on production:

  • 948,916 unique queryids in pgss_queryid_queries table (1.3 GB)
  • Flask backend (/query_info_metrics) OOM crash loop every ~5 minutes trying to process all queryids
  • VictoriaMetrics unable to scrape query info → Dashboard 2 panels empty
  • pgwatch_query_info metric never populated

Fix

Add WHERE calls >= 3 AND total_exec_time >= 1000 filter to both metrics files:

  • Prometheus sink (pgwatch-prometheus/metrics.yml): filters on calls >= 3 AND exec_time_total >= 1000 for both PG 11 and PG 17 SQL variants
  • PostgreSQL sink (pgwatch-postgres/metrics.yml): filters on calls >= 3 AND total_exec_time >= 1000

Using AND (not OR) ensures only queries that are both frequently called AND have significant execution time are exported, cutting the long tail of one-off or trivial queries.

Note: Previous approach used LIMIT 100 which is replaced by this threshold-based filter. The threshold approach is preferred because it adapts to the actual workload rather than imposing an arbitrary cap.

Production validation

Applied on production (34.203.111.251) on 2026-02-16:

  1. Truncated pgss_queryid_queries table (cleared 948K stale rows)
  2. Patched both metrics.yml files
  3. Restarted pgwatch-postgres, pgwatch-prometheus, flask backend
  4. Result: table repopulated with 851 rows (down from 948,916), flask backend stable (no more OOM kills)

Closes #143 (closed)

Edited by Nikolay Samokhvalov

Merge request reports

Loading