v0.15.0-rc.4 pre-release review: monitoring stack (Dashboard 1, Dashboard 7, metrics.yml)
## Scope
Pre-release review of the **monitoring stack** in `v0.15.0-rc.4` against `v0.14.0-rc.0`. Live Grafana inspected at the internal monitoring instance (Supabase target). Source artifacts reviewed:
- `config/grafana/dashboards/Dashboard_*.json` (especially 01 and 07)
- `config/pgwatch-prometheus/metrics.yml`
- `config/grafana/provisioning/datasources/datasources.yml`
- `postgres_ai_helm/config/grafana/dashboards/` (parity check)
Key feature deltas in this window: Dashboard 7 redesign (`40ca619`, `9dd249e`), xmin horizon monitoring (`3404937`), pg_wal directory size (`6ae6f06`), pg_stat_io / I001 (`4bed739`), datasource re-provisioning fix (`3cb237d`), 0.15 component bumps incl. target-db PG 15 → 17 (`0f6dbe1`).
Findings below are enumerated. Severity tags are suggestions, not gates.
---
## Findings
### 1. **[blocker]** Duplicate panel `id: 50` on Dashboard 1
`config/grafana/dashboards/Dashboard_1_Node_performance_overview.json` contains two panels with `"id": 50`:
- `"xmin horizon age (experimental)"` (added in `3404937` — xmin horizon)
- `"pg_wal directory size"` (added in `6ae6f06` — pg_wal monitoring)
Both features were developed in parallel and each picked `id: 50`. Confirmed live: opening
`/d/.../01-...?viewPanel=50` resolves to the **xmin horizon** panel — the pg_wal panel is unreachable via `viewPanel`/share URLs and breaks any future "Library panel"/permalink usage.
**Fix:** renumber `pg_wal directory size` to the next free id (e.g. 51) in both `config/grafana/dashboards/Dashboard_1_Node_performance_overview.json` and `postgres_ai_helm/config/grafana/dashboards/Dashboard_1_Node_performance_overview.json`. Add a CI/check that fails on duplicate panel ids across each dashboard JSON.
---
### 2. **[high]** "Vacuum timeline" panel is not scoped by cluster/node/database
Two panels:
- `Dashboard_1_Node_performance_overview.json` panel id 42
- `Dashboard_7_Autovacuum_and_xmin_horizon.json` panel id 2
Query (excerpt):
```promql
group by (schema_name, table_name, vacuum_mode) (
pgwatch_pg_vacuum_progress_index_vacuum_count{phase="7", schema_name=~"$schema_name", table_name=~"$table_name"}
) * 0 + 7 or ...
```
Missing label selectors: `cluster="$cluster_name"`, `node_name="$node_name"`, `datname="$db_name"`. On any multi-cluster Grafana, vacuum activity from all monitored instances is mixed into one chart. Every other panel on these dashboards is correctly scoped.
**Fix:** add `cluster`, `node_name`, `datname` (or `tag_datname` → `datname`) selectors to all 7 PromQL legs and to the `group by` keys if cross-DB aggregation is unwanted.
---
### 3. **[high]** Monitoring user surfaces as the top xmin horizon blocker
`metrics.yml` → `xmin_horizon_blockers` filter for `pg_stat_activity`:
```sql
where
pid <> pg_backend_pid()
and backend_type = 'client backend'
and backend_xmin is not null
```
This excludes **only the current** monitoring connection, not other concurrent scrapes from the same role. On the live Supabase target, the **xmin horizon age by source** panel renders the top blockers as:
```
pg_stat_activity: queryid=-3261990840696325108 (postgres_ai_mon@postgres)
pg_stat_activity: queryid=-7514152763537218973 (postgres_ai_mon@postgres)
```
These are the metrics-collector's own queries. The age is ~0 tx so the chart line is flat, but `Current blocker counts` still reads 1–2 (instead of 0) almost permanently — making both panels low-signal.
**Fix options:**
- Exclude the monitoring role (e.g. `usename <> current_user`, parameterized so it works on any deploy where the scraper role is configurable) — preferred, narrow and explicit.
- Or apply a minimum-age threshold (e.g. `age(backend_xmin) > 10`) before treating a row as a "blocker."
Apply the same filter symmetrically in `xmin_horizon` (the `activity` CTE in the cluster-wide aggregator), otherwise `data_horizon_age_tx` will still include the monitoring user's xmin.
---
### 4. **[medium]** Top-N XID / MultiXID age panels surface pg_catalog tables
`metrics.yml` → `table_stats` excludes `pg_*` / `information_schema` / `_timescaledb*` only in the **partition-root** CTE (`q_root_part`). The main `q_tstats` CTE has no schema filter, so `pg_catalog.*` rows are emitted on the regular-table union arm.
Live evidence on the Supabase target: top-4 series in "Top-N tables by XID age" are
`pg_catalog.pg_ts_dict`, `pg_catalog.pg_trigger`, `pg_catalog.pg_authid`, `pg_catalog.pg_db_role_setting` (all at 14,734 tx, well below any threshold). On a quiet user workload the default Dashboard 7 view is dominated by system tables.
**Fix:** either filter `not n.nspname like any (array[E'pg\\_%', 'information_schema', E'\\_timescaledb%'])` in `q_tstats` too, OR change the Dashboard 7 `schema_name` variable default to exclude `pg_*`/`information_schema` (e.g. via a regex include).
---
### 5. **[medium]** Dashboard 7 `schema_name` / `table_name` template variables are not scoped
```
schema_name → label_values(pgwatch_table_stats_tx_freeze_age, schema)
table_name → label_values(pgwatch_table_stats_tx_freeze_age{schema=~"$schema_name"}, table_name)
```
Neither selector filters by `cluster`/`node_name`/`datname`. Switching cluster/node/db leaves stale dropdown options from other instances, and `$__all` selections leak across instances.
**Fix:** add `cluster="$cluster_name", node_name="$node_name", datname="$db_name"` to both label_values queries, consistent with how `db_name` itself is scoped on the dashboard.
---
### 6. **[medium]** `pg_wal_size_status_code` has no Grafana surface
`metrics.yml` → `pg_wal_size` returns a status code (0=ok, 1=function missing, 2=insufficient EXECUTE grant) and emits **no bytes** when status > 0. The Dashboard 1 "pg_wal directory size" panel queries only `pgwatch_pg_wal_size_bytes` — so operators on PG <10 or with a stripped-down monitoring role see an empty panel without explanation.
**Fix:** add a status stat panel (or "no data" override text) driven by `pgwatch_pg_wal_size_status_code` that explains "monitoring role lacks EXECUTE on `pg_ls_waldir()`" or "pg_ls_waldir() not available on this Postgres version".
---
### 7. **[low]** `pg_wal directory size` uses `last_over_time(...[1h])` smoothing
```promql
last_over_time(pgwatch_pg_wal_size_bytes{...}[1h])
```
A 1-hour lookback masks genuine scrape gaps. For a size gauge that should change slowly, this is mostly harmless, but means an outage in pgwatch is invisible on the panel for up to an hour.
**Fix:** raw gauge + `connectNulls=false` styling, or shorter window (`[10m]` aligns with most scrape intervals).
Same panel has `stacking.mode: normal` with a single series — no visual effect, safe to drop.
---
### 8. **[low]** `xmin_horizon.data_horizon_age_tx` lumps logical slot xmin under "data" horizon
```sql
greatest(
pg_stat_activity_age_tx,
pg_replication_slots_age_tx, -- includes logical slot xmin
pg_stat_replication_age_tx,
pg_prepared_xacts_age_tx
) as data_horizon_age_tx
```
For **physical** slots, `xmin` is data-horizon-relevant; for **logical** slots, the data horizon is largely held by `catalog_xmin`, and `xmin` may be null or weakly related. Per the [xmin horizon howto](https://postgres.ai/docs/postgres-howtos/performance-optimization/monitoring/how-to-monitor-xmin-horizon) this distinction matters for RCA.
**Fix (one of):**
- Restrict the `slots` CTE to `slot_type = 'physical'` for `data_horizon_age_tx` aggregation; or
- Add a small note to the metric `description` that data_horizon aggregates all slot xmins as a worst-case bound.
---
### 9. **[low]** `pg_autovacuum_blocked` cardinality grows monotonically
Per-scrape rows are bounded by `autovacuum_max_workers` (default 3), but the Prometheus series count accumulates over time across distinct `worker_pid` / `blocker_pid` / `blocker_queryid` triples. The metric description acknowledges this; no recording rule or relabeling mitigates it.
**Fix:** consider a recording rule that pre-aggregates `wait_seconds` by `tag_blocker_queryid` only, then drop the per-pid series on metric retention, OR document a Prometheus retention recommendation in the metric description.
---
### 10. **[low]** `editable: false` on all provisioned datasources
`3cb237d` correctly forces datasource re-provisioning by deleting stale state + setting `editable: false`. Side effect: operators cannot make ad-hoc test queries / temporarily flip query timeout etc. via the Grafana UI on already-provisioned instances.
**Fix:** confirm the documented escape hatch (custom values overlay / env var) is mentioned in `CONTRIBUTING.md` or `README.md` upgrade notes.
---
### 11. **[low]** Wraparound unit suffixes — `tx` vs `Tx`/`MXID`
Dashboard 1 panels 34/35 and Dashboard 7 panels 54/55 use lowercase `tx` / `mxid` as the unit suffix. Confirm casing is intentional; otherwise normalize to a single convention.
---
### 12. **[info]** Verified, no regression
- **Helm chart parity** — `diff -q config/grafana/dashboards postgres_ai_helm/config/grafana/dashboards` shows only `README.md` (intentionally non-helm). All 16 dashboards are byte-identical between the two locations, including Dashboard 7's new layout. No divergence regression introduced.
- **Self-monitoring rename / resource limits** (`4503426`) — live dashboard list still shows "Self-monitoring dashboard" with correct tags.
- **target-db / sink-postgres bumped to Postgres 17** (`0f6dbe1`) — `pg_wal_size`, `xmin_horizon*`, and `pg_stat_io` metrics all gate at PG 11+ / 16+; no SQL incompatibility surfaced in metric definitions.
---
## Recommended ordering
1. Fix #1 (duplicate panel id) — release blocker, single-line JSON edit + CI guard.
2. Fix #2 (Vacuum timeline label scoping) — one-line PromQL edits, observable bug on any multi-cluster install.
3. Fix #3 (self-monitoring blocker noise) — required for the new xmin horizon panels to be useful as RCA tooling.
4. Fix #4–#5 (pg_catalog / variable scoping) — quality-of-life on first-time Dashboard 7 load.
5. Fix #6–#11 — small follow-ups.
## Acceptance for closing this issue
- [ ] No duplicate panel ids in any `config/grafana/dashboards/*.json` (jq one-liner in CI).
- [ ] Dashboard 1 panel "pg_wal directory size" reachable via `?viewPanel=<new id>`.
- [ ] "Vacuum timeline" panels filter by `cluster`/`node_name`/`datname` (both Dashboards 1 and 7).
- [ ] `xmin horizon age by source` on Dashboard 7 does **not** list the monitoring user as a blocker on an idle target.
- [ ] Top-N XID/MultiXID age panels do not surface `pg_catalog.*` tables on a target with no user-table wraparound risk.
- [ ] `pgwatch_pg_wal_size_status_code` is surfaced somewhere on Dashboard 1 (panel or annotation) when status > 0.
issue