v0.15.0-rc.4 pre-release review: monitoring stack (Dashboard 1, Dashboard 7, metrics.yml) (#196) · Issues · PostgresAI / postgresai

v0.15.0-rc.4 pre-release review: monitoring stack (Dashboard 1, Dashboard 7, metrics.yml)

## Scope Pre-release review of the **monitoring stack** in `v0.15.0-rc.4` against `v0.14.0-rc.0`. Live Grafana inspected at the internal monitoring instance (Supabase target). Source artifacts reviewed: - `config/grafana/dashboards/Dashboard_*.json` (especially 01 and 07) - `config/pgwatch-prometheus/metrics.yml` - `config/grafana/provisioning/datasources/datasources.yml` - `postgres_ai_helm/config/grafana/dashboards/` (parity check) Key feature deltas in this window: Dashboard 7 redesign (`40ca619`, `9dd249e`), xmin horizon monitoring (`3404937`), pg_wal directory size (`6ae6f06`), pg_stat_io / I001 (`4bed739`), datasource re-provisioning fix (`3cb237d`), 0.15 component bumps incl. target-db PG 15 → 17 (`0f6dbe1`). Findings below are enumerated. Severity tags are suggestions, not gates. --- ## Findings ### 1. **[blocker]** Duplicate panel `id: 50` on Dashboard 1 `config/grafana/dashboards/Dashboard_1_Node_performance_overview.json` contains two panels with `"id": 50`: - `"xmin horizon age (experimental)"` (added in `3404937` — xmin horizon) - `"pg_wal directory size"` (added in `6ae6f06` — pg_wal monitoring) Both features were developed in parallel and each picked `id: 50`. Confirmed live: opening `/d/.../01-...?viewPanel=50` resolves to the **xmin horizon** panel — the pg_wal panel is unreachable via `viewPanel`/share URLs and breaks any future "Library panel"/permalink usage. **Fix:** renumber `pg_wal directory size` to the next free id (e.g. 51) in both `config/grafana/dashboards/Dashboard_1_Node_performance_overview.json` and `postgres_ai_helm/config/grafana/dashboards/Dashboard_1_Node_performance_overview.json`. Add a CI/check that fails on duplicate panel ids across each dashboard JSON. --- ### 2. **[high]** "Vacuum timeline" panel is not scoped by cluster/node/database Two panels: - `Dashboard_1_Node_performance_overview.json` panel id 42 - `Dashboard_7_Autovacuum_and_xmin_horizon.json` panel id 2 Query (excerpt): ```promql group by (schema_name, table_name, vacuum_mode) ( pgwatch_pg_vacuum_progress_index_vacuum_count{phase="7", schema_name=~"$schema_name", table_name=~"$table_name"} ) * 0 + 7 or ... ``` Missing label selectors: `cluster="$cluster_name"`, `node_name="$node_name"`, `datname="$db_name"`. On any multi-cluster Grafana, vacuum activity from all monitored instances is mixed into one chart. Every other panel on these dashboards is correctly scoped. **Fix:** add `cluster`, `node_name`, `datname` (or `tag_datname` → `datname`) selectors to all 7 PromQL legs and to the `group by` keys if cross-DB aggregation is unwanted. --- ### 3. **[high]** Monitoring user surfaces as the top xmin horizon blocker `metrics.yml` → `xmin_horizon_blockers` filter for `pg_stat_activity`: ```sql where pid <> pg_backend_pid() and backend_type = 'client backend' and backend_xmin is not null ``` This excludes **only the current** monitoring connection, not other concurrent scrapes from the same role. On the live Supabase target, the **xmin horizon age by source** panel renders the top blockers as: ``` pg_stat_activity: queryid=-3261990840696325108 (postgres_ai_mon@postgres) pg_stat_activity: queryid=-7514152763537218973 (postgres_ai_mon@postgres) ``` These are the metrics-collector's own queries. The age is ~0 tx so the chart line is flat, but `Current blocker counts` still reads 1–2 (instead of 0) almost permanently — making both panels low-signal. **Fix options:** - Exclude the monitoring role (e.g. `usename <> current_user`, parameterized so it works on any deploy where the scraper role is configurable) — preferred, narrow and explicit. - Or apply a minimum-age threshold (e.g. `age(backend_xmin) > 10`) before treating a row as a "blocker." Apply the same filter symmetrically in `xmin_horizon` (the `activity` CTE in the cluster-wide aggregator), otherwise `data_horizon_age_tx` will still include the monitoring user's xmin. --- ### 4. **[medium]** Top-N XID / MultiXID age panels surface pg_catalog tables `metrics.yml` → `table_stats` excludes `pg_*` / `information_schema` / `_timescaledb*` only in the **partition-root** CTE (`q_root_part`). The main `q_tstats` CTE has no schema filter, so `pg_catalog.*` rows are emitted on the regular-table union arm. Live evidence on the Supabase target: top-4 series in "Top-N tables by XID age" are `pg_catalog.pg_ts_dict`, `pg_catalog.pg_trigger`, `pg_catalog.pg_authid`, `pg_catalog.pg_db_role_setting` (all at 14,734 tx, well below any threshold). On a quiet user workload the default Dashboard 7 view is dominated by system tables. **Fix:** either filter `not n.nspname like any (array[E'pg\\_%', 'information_schema', E'\\_timescaledb%'])` in `q_tstats` too, OR change the Dashboard 7 `schema_name` variable default to exclude `pg_*`/`information_schema` (e.g. via a regex include). --- ### 5. **[medium]** Dashboard 7 `schema_name` / `table_name` template variables are not scoped ``` schema_name → label_values(pgwatch_table_stats_tx_freeze_age, schema) table_name → label_values(pgwatch_table_stats_tx_freeze_age{schema=~"$schema_name"}, table_name) ``` Neither selector filters by `cluster`/`node_name`/`datname`. Switching cluster/node/db leaves stale dropdown options from other instances, and `$__all` selections leak across instances. **Fix:** add `cluster="$cluster_name", node_name="$node_name", datname="$db_name"` to both label_values queries, consistent with how `db_name` itself is scoped on the dashboard. --- ### 6. **[medium]** `pg_wal_size_status_code` has no Grafana surface `metrics.yml` → `pg_wal_size` returns a status code (0=ok, 1=function missing, 2=insufficient EXECUTE grant) and emits **no bytes** when status > 0. The Dashboard 1 "pg_wal directory size" panel queries only `pgwatch_pg_wal_size_bytes` — so operators on PG <10 or with a stripped-down monitoring role see an empty panel without explanation. **Fix:** add a status stat panel (or "no data" override text) driven by `pgwatch_pg_wal_size_status_code` that explains "monitoring role lacks EXECUTE on `pg_ls_waldir()`" or "pg_ls_waldir() not available on this Postgres version". --- ### 7. **[low]** `pg_wal directory size` uses `last_over_time(...[1h])` smoothing ```promql last_over_time(pgwatch_pg_wal_size_bytes{...}[1h]) ``` A 1-hour lookback masks genuine scrape gaps. For a size gauge that should change slowly, this is mostly harmless, but means an outage in pgwatch is invisible on the panel for up to an hour. **Fix:** raw gauge + `connectNulls=false` styling, or shorter window (`[10m]` aligns with most scrape intervals). Same panel has `stacking.mode: normal` with a single series — no visual effect, safe to drop. --- ### 8. **[low]** `xmin_horizon.data_horizon_age_tx` lumps logical slot xmin under "data" horizon ```sql greatest( pg_stat_activity_age_tx, pg_replication_slots_age_tx, -- includes logical slot xmin pg_stat_replication_age_tx, pg_prepared_xacts_age_tx ) as data_horizon_age_tx ``` For **physical** slots, `xmin` is data-horizon-relevant; for **logical** slots, the data horizon is largely held by `catalog_xmin`, and `xmin` may be null or weakly related. Per the [xmin horizon howto](https://postgres.ai/docs/postgres-howtos/performance-optimization/monitoring/how-to-monitor-xmin-horizon) this distinction matters for RCA. **Fix (one of):** - Restrict the `slots` CTE to `slot_type = 'physical'` for `data_horizon_age_tx` aggregation; or - Add a small note to the metric `description` that data_horizon aggregates all slot xmins as a worst-case bound. --- ### 9. **[low]** `pg_autovacuum_blocked` cardinality grows monotonically Per-scrape rows are bounded by `autovacuum_max_workers` (default 3), but the Prometheus series count accumulates over time across distinct `worker_pid` / `blocker_pid` / `blocker_queryid` triples. The metric description acknowledges this; no recording rule or relabeling mitigates it. **Fix:** consider a recording rule that pre-aggregates `wait_seconds` by `tag_blocker_queryid` only, then drop the per-pid series on metric retention, OR document a Prometheus retention recommendation in the metric description. --- ### 10. **[low]** `editable: false` on all provisioned datasources `3cb237d` correctly forces datasource re-provisioning by deleting stale state + setting `editable: false`. Side effect: operators cannot make ad-hoc test queries / temporarily flip query timeout etc. via the Grafana UI on already-provisioned instances. **Fix:** confirm the documented escape hatch (custom values overlay / env var) is mentioned in `CONTRIBUTING.md` or `README.md` upgrade notes. --- ### 11. **[low]** Wraparound unit suffixes — `tx` vs `Tx`/`MXID` Dashboard 1 panels 34/35 and Dashboard 7 panels 54/55 use lowercase `tx` / `mxid` as the unit suffix. Confirm casing is intentional; otherwise normalize to a single convention. --- ### 12. **[info]** Verified, no regression - **Helm chart parity** — `diff -q config/grafana/dashboards postgres_ai_helm/config/grafana/dashboards` shows only `README.md` (intentionally non-helm). All 16 dashboards are byte-identical between the two locations, including Dashboard 7's new layout. No divergence regression introduced. - **Self-monitoring rename / resource limits** (`4503426`) — live dashboard list still shows "Self-monitoring dashboard" with correct tags. - **target-db / sink-postgres bumped to Postgres 17** (`0f6dbe1`) — `pg_wal_size`, `xmin_horizon*`, and `pg_stat_io` metrics all gate at PG 11+ / 16+; no SQL incompatibility surfaced in metric definitions. --- ## Recommended ordering 1. Fix #1 (duplicate panel id) — release blocker, single-line JSON edit + CI guard. 2. Fix #2 (Vacuum timeline label scoping) — one-line PromQL edits, observable bug on any multi-cluster install. 3. Fix #3 (self-monitoring blocker noise) — required for the new xmin horizon panels to be useful as RCA tooling. 4. Fix #4–#5 (pg_catalog / variable scoping) — quality-of-life on first-time Dashboard 7 load. 5. Fix #6–#11 — small follow-ups. ## Acceptance for closing this issue - [ ] No duplicate panel ids in any `config/grafana/dashboards/*.json` (jq one-liner in CI). - [ ] Dashboard 1 panel "pg_wal directory size" reachable via `?viewPanel=<new id>`. - [ ] "Vacuum timeline" panels filter by `cluster`/`node_name`/`datname` (both Dashboards 1 and 7). - [ ] `xmin horizon age by source` on Dashboard 7 does **not** list the monitoring user as a blocker on an idle target. - [ ] Top-N XID/MultiXID age panels do not surface `pg_catalog.*` tables on a target with no user-table wraparound risk. - [ ] `pgwatch_pg_wal_size_status_code` is surfaced somewhere on Dashboard 1 (panel or annotation) when status > 0.

issue