F003 Autovacuum analysis: dead tuples
Terminology
- row - logical concept of a row in a table
- tuple (a.k.a. "row version) – physical object on disk / in memory. There might be multiple tuples corresponding to a single row: if row is updated, we have an old version of it (old tuple), and a new, current version of it (new tuple) visible to our current and all new transactions.
What is VACUUM
VACUUM reclaims storage occupied by dead tuples. In normal PostgreSQL operation, tuples that are deleted or obsoleted by an update are not physically removed from their table; they remain present until a VACUUM is done. Therefore it's necessary to do VACUUM periodically, especially on frequently-updated tables.
https://www.postgresql.org/docs/current/sql-vacuum.html
Autovacuum
There is a designated process – autovacuum, which runs VACUUM automatically, periodically, based on a number of various factors. It also does additional work:
- recalculates statistics (the same as manual ANALYZE does), and
- protects from transaction wraparound https://www.postgresql.org/docs/current/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND
The problem
The default autovacuum settings are very conservative – in a few words, autovacuum visits tables very rare, allowing too many dead tuples.
The goal
This report is the first step in autovacuum analysis and tuning process. This process usually involves answering to questions:
- do we have many dead tuples in big tables (20% can be considered as a big number)?
- are there any risks to hit transaction wraparound problem?
- do we have bloat problem? (see https://gitlab.com/postgres.ai/postgres-health-check/issues/87 and https://gitlab.com/postgres.ai/postgres-health-check/issues/88)
- and finally, does autovacuum consume a lot of resources (if it's tuned too aggressively, it typically leads to high reads IO; we also need to care about CPU consumption, of
autovacuum_workers
is too high and we have low number of CPU cores).
Dead Tuple Ratio, %
)
How we determine % of dead tuples (Reading data from pg_stat_all_tables
, we take n_live_tup
and n_dead_tup
, and use the following simple formula to get the ratio in %:
100 % n_dead_tup / (n_dead_tup + n_live_tup)
Two things must be noticed here:
-
n_live_tup
andn_dead_tup
might be wrong, lagging if ANALYZE was long ago (to address that, we also displayreltuples
frompg_class
- if it's too far fromn_live_tup
, the lagging problem is obvious) - pg_stat_*** data can be reset recently, we need to see how long ago. Resetting time is stored in
pg_stat_database.stats_reset
, this needs to be displayed above the report's table.
Acceptance criteria
As a DBA, when checkup is performed, I see the table with the top-50 list of tables containing the largest values of "Dead Tuple Ratio", and also:
- see how many times autovacuum or manual VACUUM was performed, how long ago it was done,
- see
n_live_tup
andpg_class.reltuples
next to each other allowing easy comparison of them (if they are not close to each other, I make conclusion that number are not trustworthy and ANALYZE is needed), - and see when and how long ago statistics collector was reset.