High and spiky dead tuple percentage on project_mirror_data

We've been seeing dead tuple stats for project_mirror_data in the range of 10-20% of the table which is already surprisingly high. But recently we've seen spikes as high as 50% which is alarming.

This table has been getting vacuumed once per minute which given our autovacuum_naptime of 1min basically means autovacuum is vacuuming it whenever it gets a chance. That's alarming because it means if the rate of dead tuples increases then autovacuum can't increase the rate of vacuuming and the peak dead tuples will grow out of control.

For context this table has only 34,354 so it's easy for the number of dead tuples to rise to a significant percent of the table in a short time.

@_stark tried decreasing the autovacuum_naptime to allow more frequent vacuums. And indeed the table was vacuumed more frequently. And the total dead tuples rose precipitously. That makes no sense if things were working as intended.

@yorickpeterse adjusted the cost delay parameters to allow vacuums to complete quicker and that appeared to help though the data is erratic enough that it's difficult to tell.

Further analysis -- the vacuums on this table are very fast. They take less than a second to run, usually about 100-200ms. Speeding up the vacuums will have no effect.

Generally vacuums clean up a fairly steady rate of about 2-4k tuples and leave a fairly constant 4-6k of dead tuples. There's no problem with the rate of vacuuming. However there were a number of occasions throughout the day when vacuum cleaned up 0 tuples. Even for multiple vacuums in a row. for one period there were 73 vacuums in a row which removed 0 tuples. During that period the number of nonremoveable dead tuples rose steadily to 117,902. That means 77% of the table was dead tuples. Not because vacuum wasn't running enough but because they were still visible to a long-running transaction somewhere or a lagging replica.

The total number of tuples in the table estimated tracked in the pg statistics seems to fluctuate dramatically resulting in unreliable estimates of the percentage of tuples in the table. The estimate seems to increase consistently on every vacuum except ones following an analyze where it drops again. I'm not sure how the "tuples remaining" actually affects the count of dead tuples but if it resulted in the estimate of dead tuples rising it might explain why decreasing the naptime increased the estimated dead tuples. It would be a measurement error though which analyze subsequently corrected. It might also indicate that decreasing the rate of analyzes would affect the accuracy of this measurement as well and might have surprising effects on the measured dead tuples.

The actual total size of the table never changed 5,005 pages. So any problems with dead tuples are not causing any significant problems with table bloat (or if they have it's not causing any increase in it over earlier bloat).