Dashboard 7 redesign: Autovacuum and xmin horizon RCA
Closes #177 (closed)
Goal
Make Dashboard 7 a great triage tool for "is autovacuum keeping up, who is blocking xmin horizon, and how close are we to wraparound?" — and keep wraparound risk one click away on Dashboard 1.
References:
Dashboard 7 — Autovacuum and xmin horizon
Renamed from 07. Autovacuum, xmin horizon, bloat → 07. Autovacuum and xmin horizon. No bloat panels live here, and wraparound + xmin horizon pressure are autovacuum's failure modes — coherent narrative.
Top-down structure:
- Wraparound risk — top-N tables (drill down to specific tables) — XID and MultiXID timeseries, full-width stacked. Each chart adds three reference lines pulled live from
pg_settings:- soft (
autovacuum_freeze_max_age/autovacuum_multixact_freeze_max_age) - failsafe (
vacuum_failsafe_age/vacuum_multixact_failsafe_age, PG14+) - hard (
2^31 − 10M≈ 2.146B, static — Postgres refuses new XIDs at this point)
- soft (
- xmin horizon overview (experimental) — per-source age timeseries (pg_stat_activity, pg_replication_slots xmin/catalog_xmin, pg_stat_replication, pg_prepared_xacts) with the active blocker identity (queryid / slot_name / standby_name / prepared_gid) embedded in the legend.
- xmin horizon blockers (experimental) — longest non-idle transaction age + current blocker counts.
- Autovacuum mechanics —
Autovacuum debt — top-N overdue tables(per-table dead-tuple debt vs autovacuum trigger threshold, sorted byautovacuum_overdue_factor, with a 1.0 reference line — anything above is overdue), worker pool active-vs-max saturation, "Autovacuum workers blocked on lock" table, and the existing vacuum timeline.
A top_n template variable controls Top-N panel limits dashboard-wide (5/10/15/20/25/50/100, default 20).
Dashboard 1 — Single node performance overview (high-level)
- Renamed both database-level wraparound panels for clarity:
Database age (datfrozenxid) — wraparound riskandDatabase multixid age (datminmxid) — wraparound risk. Each panel now carries a description covering the gotchas (age in transactions not seconds; per-databasemin(pg_class.relfrozenxid);autovacuum_freeze_max_agetriggers anti-wraparound autovacuum; ~2^31 − 10Mcauses the cluster to refuse new XIDs). Same threshold reference lines as on Dashboard 7. - Placed
DB logical size distributionandpg_wal directory sizeside-by-side on a single row.
New / changed metrics in metrics.yml
pg_settings_wraparound— snapshotsautovacuum_freeze_max_age,autovacuum_multixact_freeze_max_age,vacuum_failsafe_age,vacuum_multixact_failsafe_ageso dashboards can plot threshold lines that reflect actual cluster config (not compiled defaults).table_stats— newmxid_freeze_agecolumn frommxid_age(c.relminmxid)(previouslyage(c.relminmxid)was used, which on a fresh DB withrelminmxid=1and a small current xid wraps to2^31 − 1— fixed in this MR).pg_autovacuum_workers— active autovacuum worker count vsautovacuum_max_workersand free slots.pg_autovacuum_queue— per-tablen_dead_tup, computedautovacuum_threshold(withpg_class.reloptionsoverrides applied), and the derivedautovacuum_overdue_factor(n_dead_tup ÷ threshold).pg_autovacuum_blocked— autovacuum workers waiting on a lock joined to the lock holder (pid, queryid, wait seconds). Filters onbackend_type='autovacuum worker'before joiningpg_locksto keep cost down. Cardinality bounded byautovacuum_max_workers.pg_vacuum_progress— addedis_anti_wraparoundboolean derived frompg_stat_activity.backend_xid IS NOT NULL. More robust than the existing query-text regex (which depends on the literal autovacuum query text and breaks under non-English locales / future wording changes).
Cross-cutting
- Uniform
Dashboardsdropdown link withkeepTime: trueandincludeVars: trueon every dashboard, plus apostgres-aitag on each, so cross-dashboard navigation preserves time range and variable values. - Added missing helm symlink for
Dashboard_14_IO_Statistics.json.
Verification
A rotating xmin-horizon blocker workload (long REPEATABLE READ tx → logical slot → prepared xact → physical slot → stacked) was used to verify each blocker class lights up the relevant panel. New pgwatch_table_stats_mxid_freeze_age, pgwatch_pg_settings_wraparound_*, pgwatch_pg_autovacuum_workers_*, pgwatch_pg_autovacuum_queue_*, and pgwatch_pg_autovacuum_blocked_* series all confirmed flowing into VictoriaMetrics. Dashboards 1 and 7 rendered end-to-end against the demo target. The mxid_age fix verified to drop a fresh-DB mxid_freeze_age from 2147483647 → 0.
Test plan
- Smoke-check Dashboard 7 against any cluster — wraparound, xmin horizon, autovacuum mechanics rows all populate.
- Confirm
top_ndropdown works on the Top-N panels. - Click into a queryid from the autovacuum workers blocked-on-lock table — should drilldown to Dashboard 03 with
var-queryidpopulated and time/vars preserved. - Use the
Dashboardsdropdown link from any dashboard — destination dashboard should keepfrom/toand template variables. - On a PG14+ cluster, the failsafe threshold line is visible on the wraparound charts; on PG13, it is filtered out (
> 0guard).

