Develop recommendation/documentation/solution to help avoid critical database issue due to huge xlog size
We are seeing an increase in problems with PostgreSQL xlog filling up the disk for customers using Geo. The cause of this is one or more secondaries falling behind in replication (maybe they went offline, etc). The primary attempts to hold on to old transaction logs so the primary is able to pickup replication when it comes back online. However, many times this results in the root partition running out of disk space and bringing the entire server to a halt.
In the best case this requires us to go find some free disk space in order to get PG back online and remove the bad replication slots to free up the old xlogs. In this case we've recommended customers use tune2fs to claim some of the disk's 'reserved space' temporarily.
In the worst of cases, this cause corruption in the primary database due to the unsafe way in which PG is stopped. In this case recovery is very difficult and data loss may occur. We can try something invasive like pg_resetxlog or a variety of other methods but if the corruption is too severe then a restore from the latest backup may be the only solution.
- Can we prevent xlog from using all disk space via PG settings? If we have to leave some secondaries behind it's better to do that and alert the user than to bring down the primary, too.
- Should we update documentation to recommend putting xlog on a separate partition so it cannot so easily kill the entire database?
FYI @ernstvn
@nick.thomas Would love your input on possible solutions.