pg_stat_statements metric exports unlimited queryids, causing VictoriaMetrics OOM on high-cardinality databases
## Problem
The `pg_stat_statements` metric in `config/pgwatch-prometheus/metrics.yml` exports **every** queryid from `pg_stat_statements` with no `LIMIT` or top-N filter.
On databases with high query diversity (e.g. managed Postgres services, ORMs generating many unique queries), this can produce 10,000+ unique time series. Each series has ~15 counter columns, so the total label combinations explode.
VictoriaMetrics then fails when evaluating PromQL functions like `irate()` across all these series:
```
cannot execute query: not enough memory for processing
irate(pgwatch_pg_stat_statements_calls...)
which returns 12726613 data points across 11773 time series with 1081 points
total available memory: 241591910 bytes
requested memory: 215398808 bytes
```
This effectively breaks Grafana dashboards that query pg_stat_statements metrics.
Additionally, the Flask backend (`monitoring_flask_backend` / `flask-pgss-api`) also OOMs when loading all queryid rows into Python memory. The container is limited to 192m and gets SIGKILL'd repeatedly:
```
[ERROR] Worker (pid:18133) was sent SIGKILL! Perhaps out of memory?
```
## Root cause
Both the PG 11 and PG 17 SQL variants in the `pg_stat_statements` metric select from `aggregated_statements` without any row limit:
```sql
select
datname as tag_datname,
queryid as tag_queryid,
calls::int8 as calls,
...
from aggregated_statements
```
No `ORDER BY` + `LIMIT`, no top-N filter. A Supabase database was observed with 4,910 unique queryids — typical databases have 100-500.
## Proposed fix
1. **Limit the Prometheus-sink metric** to the top N queryids (e.g. top 200 by total exec time):
```sql
select
datname as tag_datname,
queryid as tag_queryid,
calls::int8 as calls,
...
from aggregated_statements
order by exec_time_total desc
limit 200
```
This keeps cardinality bounded regardless of how many unique queries the database has. The PostgreSQL sink (`pgwatch-postgres`) can continue to store all queryids since it doesn't have the same cardinality constraints.
2. **Increase `monitoring_flask_backend` mem_limit** from 192m to 512m in docker-compose.yml as a safety net for the Flask backend which also processes pg_stat_statements data.
## Notes
- The VictoriaMetrics `mem_limit` is 1536m in docker-compose.yml. Increasing it is a band-aid, not a fix — cardinality will keep growing.
- The `pgss_queryid_queries` metric in `pgwatch-postgres/metrics.yml` (which stores query text) already operates on a separate sink and is not affected.
- This is specifically a Prometheus-sink cardinality problem. The PostgreSQL sink handles high cardinality fine.
---
## Acceptance Criteria
- `pg_stat_statements` Prometheus-sink metric limited to top N queryids (e.g., top 200 by total exec time) using `ORDER BY exec_time_total DESC LIMIT 200`
- VictoriaMetrics no longer OOMs when processing `irate()` over pg_stat_statements series
- Grafana dashboards querying pg_stat_statements metrics load successfully without memory errors
- `monitoring_flask_backend` memory limit increased from 192m to 512m in docker-compose.yml as safety net
- PostgreSQL sink (`pgwatch-postgres`) continues to store all queryids without any limit
- The top-N filter applies to both PG 11 and PG 17 SQL variants in the metric definition
## Definition of Done
- Metric SQL updated in `config/pgwatch-prometheus/metrics.yml` with row limit
- Flask backend memory limit updated in docker-compose.yml
- Tested on a high-cardinality database (4000+ unique queryids) confirming VictoriaMetrics stability
- Grafana pg_stat_statements panels render without OOM or 500 errors
- No loss of important query metrics (top queries by execution time are always captured)
issue