F002 Transaction wraparound check

Basics

See https://gitlab.com/postgres.ai/postgres-health-check/issues/164.

Transaction wraparound is a specific problem and it's and additional goal of autovacuum.

Documentation: https://www.postgresql.org/docs/current/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND

See also "5.1. Transaction ID" in http://www.interdb.jp/pg/pgsql05.html.

Amazon RDS documentation on this topic: https://aws.amazon.com/ru/premiumsupport/knowledge-center/transaction-id-wraparound-effects/

The goal

We need to see how risky is the current state, what are risks to hit transaction wraparound problem soon – for that, we check relfrozenxid for each table (considering also related TOAST tables), and see, where are we not for each database and "the worst" tables in the current database. If age > 1200000000 (capacity is ~2000000000), then we will need to check autovacuum settings or manually run vacuum freeze.

Queries

Per instance, all databases (we assume that we don't have a lot of them):

select
  datname,
  age(datfrozenxid),
  round(
    age(datfrozenxid)::numeric * 100
      / (2 * 10^9 - current_setting('vacuum_freeze_min_age')::numeric
    )::numeric,
    2
  ) as capacity_used,
  datfrozenxid
from pg_database
order by 2 desc;

Per database, top-50 list:

select
  coalesce(nullif(n.nspname || '.', 'public.'), '') || c.relname,
  greatest(age(c.relfrozenxid), age(t.relfrozenxid)) as age,
  round(
    greatest(age(c.relfrozenxid), age(t.relfrozenxid))::numeric * 100
      / (2 * 10^9 - current_setting('vacuum_freeze_min_age')::numeric
    )::numeric,
    2
  ) as capacity_used,
  c.relfrozenxid as rel_relfrozenxid,
  t.relfrozenxid as toast_relfrozenxid
from pg_class c
join pg_namespace n on c.relnamespace = n.oid
left join pg_class t ON c.reltoastrelid = t.oid
where c.relkind IN ('r', 'm')
order by 3 desc
limit 50;

Acceptance criteria

As a DBA, when performing checkup for a cluster, I need to ensure, that there are no risks of hitting transaction wraparound in short- or mid-term. To ensure, I need to see:

  • the list of databases and "used" capacity (how much % of the ceiling is used), as well as raw "age", to be able to understand, how far is it from 2,000,000,000,
  • the top-50 (by "age") list of tables in the database I'm checking, with the same metrics: % of capacity used, age. If I see high values in the "capacity used" column (>50%), I understand that actions will be needed (vacuum freeze, additional autovacuum tuning).
Edited by Anna