F002 Transaction wraparound check
Basics
See https://gitlab.com/postgres.ai/postgres-health-check/issues/164.
Transaction wraparound is a specific problem and it's and additional goal of autovacuum.
Documentation: https://www.postgresql.org/docs/current/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND
See also "5.1. Transaction ID" in http://www.interdb.jp/pg/pgsql05.html.
Amazon RDS documentation on this topic: https://aws.amazon.com/ru/premiumsupport/knowledge-center/transaction-id-wraparound-effects/
The goal
We need to see how risky is the current state, what are risks to hit transaction wraparound problem soon – for that, we check relfrozenxid for each table (considering also related TOAST tables), and see, where are we not for each database and "the worst" tables in the current database. If age > 1200000000 (capacity is ~2000000000), then we will need to check autovacuum settings or manually run vacuum freeze.
Queries
Per instance, all databases (we assume that we don't have a lot of them):
select
datname,
age(datfrozenxid),
round(
age(datfrozenxid)::numeric * 100
/ (2 * 10^9 - current_setting('vacuum_freeze_min_age')::numeric
)::numeric,
2
) as capacity_used,
datfrozenxid
from pg_database
order by 2 desc;
Per database, top-50 list:
select
coalesce(nullif(n.nspname || '.', 'public.'), '') || c.relname,
greatest(age(c.relfrozenxid), age(t.relfrozenxid)) as age,
round(
greatest(age(c.relfrozenxid), age(t.relfrozenxid))::numeric * 100
/ (2 * 10^9 - current_setting('vacuum_freeze_min_age')::numeric
)::numeric,
2
) as capacity_used,
c.relfrozenxid as rel_relfrozenxid,
t.relfrozenxid as toast_relfrozenxid
from pg_class c
join pg_namespace n on c.relnamespace = n.oid
left join pg_class t ON c.reltoastrelid = t.oid
where c.relkind IN ('r', 'm')
order by 3 desc
limit 50;
Acceptance criteria
As a DBA, when performing checkup for a cluster, I need to ensure, that there are no risks of hitting transaction wraparound in short- or mid-term. To ensure, I need to see:
- the list of databases and "used" capacity (how much % of the ceiling is used), as well as raw "age", to be able to understand, how far is it from 2,000,000,000,
- the top-50 (by "age") list of tables in the database I'm checking, with the same metrics: % of capacity used, age. If I see high values in the "capacity used" column (>50%), I understand that actions will be needed (
vacuum freeze, additional autovacuum tuning).