Dashboard 7 redesign: Autovacuum and xmin horizon RCA

Closes #177 (closed)

Goal

Make Dashboard 7 a great triage tool for "is autovacuum keeping up, who is blocking xmin horizon, and how close are we to wraparound?" — and keep wraparound risk one click away on Dashboard 1.

References:

Dashboard 7 — Autovacuum and xmin horizon

Renamed from 07. Autovacuum, xmin horizon, bloat07. Autovacuum and xmin horizon. No bloat panels live here, and wraparound + xmin horizon pressure are autovacuum's failure modes — coherent narrative.

Top-down structure:

  1. Wraparound risk — top-N tables (drill down to specific tables) — XID and MultiXID timeseries, full-width stacked. Each chart adds three reference lines pulled live from pg_settings:
    • soft (autovacuum_freeze_max_age / autovacuum_multixact_freeze_max_age)
    • failsafe (vacuum_failsafe_age / vacuum_multixact_failsafe_age, PG14+)
    • hard (2^31 − 10M ≈ 2.146B, static — Postgres refuses new XIDs at this point)
  2. xmin horizon overview (experimental) — per-source age timeseries (pg_stat_activity, pg_replication_slots xmin/catalog_xmin, pg_stat_replication, pg_prepared_xacts) with the active blocker identity (queryid / slot_name / standby_name / prepared_gid) embedded in the legend.
  3. xmin horizon blockers (experimental) — longest non-idle transaction age + current blocker counts.
  4. Autovacuum mechanicsAutovacuum debt — top-N overdue tables (per-table dead-tuple debt vs autovacuum trigger threshold, sorted by autovacuum_overdue_factor, with a 1.0 reference line — anything above is overdue), worker pool active-vs-max saturation, "Autovacuum workers blocked on lock" table, and the existing vacuum timeline.

A top_n template variable controls Top-N panel limits dashboard-wide (5/10/15/20/25/50/100, default 20).

Dashboard 7

Dashboard 1 — Single node performance overview (high-level)

  • Renamed both database-level wraparound panels for clarity: Database age (datfrozenxid) — wraparound risk and Database multixid age (datminmxid) — wraparound risk. Each panel now carries a description covering the gotchas (age in transactions not seconds; per-database min(pg_class.relfrozenxid); autovacuum_freeze_max_age triggers anti-wraparound autovacuum; ~2^31 − 10M causes the cluster to refuse new XIDs). Same threshold reference lines as on Dashboard 7.
  • Placed DB logical size distribution and pg_wal directory size side-by-side on a single row.

Dashboard 1

New / changed metrics in metrics.yml

  • pg_settings_wraparound — snapshots autovacuum_freeze_max_age, autovacuum_multixact_freeze_max_age, vacuum_failsafe_age, vacuum_multixact_failsafe_age so dashboards can plot threshold lines that reflect actual cluster config (not compiled defaults).
  • table_stats — new mxid_freeze_age column from mxid_age(c.relminmxid) (previously age(c.relminmxid) was used, which on a fresh DB with relminmxid=1 and a small current xid wraps to 2^31 − 1 — fixed in this MR).
  • pg_autovacuum_workers — active autovacuum worker count vs autovacuum_max_workers and free slots.
  • pg_autovacuum_queue — per-table n_dead_tup, computed autovacuum_threshold (with pg_class.reloptions overrides applied), and the derived autovacuum_overdue_factor (n_dead_tup ÷ threshold).
  • pg_autovacuum_blocked — autovacuum workers waiting on a lock joined to the lock holder (pid, queryid, wait seconds). Filters on backend_type='autovacuum worker' before joining pg_locks to keep cost down. Cardinality bounded by autovacuum_max_workers.
  • pg_vacuum_progress — added is_anti_wraparound boolean derived from pg_stat_activity.backend_xid IS NOT NULL. More robust than the existing query-text regex (which depends on the literal autovacuum query text and breaks under non-English locales / future wording changes).

Cross-cutting

  • Uniform Dashboards dropdown link with keepTime: true and includeVars: true on every dashboard, plus a postgres-ai tag on each, so cross-dashboard navigation preserves time range and variable values.
  • Added missing helm symlink for Dashboard_14_IO_Statistics.json.

Verification

A rotating xmin-horizon blocker workload (long REPEATABLE READ tx → logical slot → prepared xact → physical slot → stacked) was used to verify each blocker class lights up the relevant panel. New pgwatch_table_stats_mxid_freeze_age, pgwatch_pg_settings_wraparound_*, pgwatch_pg_autovacuum_workers_*, pgwatch_pg_autovacuum_queue_*, and pgwatch_pg_autovacuum_blocked_* series all confirmed flowing into VictoriaMetrics. Dashboards 1 and 7 rendered end-to-end against the demo target. The mxid_age fix verified to drop a fresh-DB mxid_freeze_age from 2147483647 → 0.

Test plan

  • Smoke-check Dashboard 7 against any cluster — wraparound, xmin horizon, autovacuum mechanics rows all populate.
  • Confirm top_n dropdown works on the Top-N panels.
  • Click into a queryid from the autovacuum workers blocked-on-lock table — should drilldown to Dashboard 03 with var-queryid populated and time/vars preserved.
  • Use the Dashboards dropdown link from any dashboard — destination dashboard should keep from/to and template variables.
  • On a PG14+ cluster, the failsafe threshold line is visible on the wraparound charts; on PG13, it is filtered out (> 0 guard).
Edited by Nikolay Samokhvalov

Merge request reports

Loading