Right now we only generate an LVM snapshot once every 24 hours, which is used to populate our staging databases. This has proven to be inadequate as a recovery technique. Instead we need to create these snapshots at least once an hour, and keep them around for at least 24 hours; 48 if possible. This would allow us to quickly restore disk snapshots without having to use (potentially messy) Azure tools, restore databases using PostgreSQL restoration tools, or something else.
@jnijhof: Since you're our LVM expert (as far as I know) I'm assigning this to you, feel free to re-assign if necessary.
Designs
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related or that one is blocking others.
Learn more.
Do you have all the database (data, tablespaces, pg_xlog) on the same LVM volume? If not, ensure you can snapshot all volumes in an atomic operation, at the same time.
The thing about LVM snapshots and PostgreSQL is that I do not believe that you're getting good data because the database has not been quiesced.
@helsus not really, no. We use a copy of production in staging so that we can get timings of things like database migrations and other operations at scale with a full data load. We treat the staging environment the exact same way we treat production with regard to security and limited access, logging and auditing, and controls. It would be a risk if we had a different set of standards for how we worked with / treated staging in comparison to production given that the data is the same.
LVM snapshots already proofed it works since we use that in staging where we snapshot our replication db (on staging) to the current staging db (on staging). But I think some people are mixing up some things.
Staging was never meant to restore from, this was just a copy from production which is refreshing every 24 hours.
We are not doing LVM snapshots neither Azure snapshots for our db's in production.
But since LVM snapshots are actually working for us at staging we might consider it to implement it for production as well.
There is a tiny lvm_snapshot script in place /usr/local/bin/lvm_snapshot which will run every hour with a retention of 3. I need to wrap things up in docs/chef etc. though.
@jnijhof Keeping the snapshots for only 3 hours doesn't give admins a lot of time to catch an issue. I would recommend keeping at least 24 previous snapshots.
consider that the solution proposed here (periodic LVM snapshots) is not the only way to achieve the business objectives. (zero customer facing downtime, even in the face of node failures by reason of hardware/software/human error), and the snapshots still do not capture the subsequent PG transaction log file events.
Another enhancement / solution to consider would be to have the PG transaction log file on its own HA/DR volume mirrored to another DC region at a block level (DRBD device, or appropriate DHT merkle-tree filesystem).