Set up tools to allow debugging xlog spikes which cause replication lag

We have existing post-mortem issues for debugging the cause of replication lag incidents. The basic process for doing this is:

Set up WAL-E with production decryption keys Run wal-e to download segments for the relevant time period Build and run xlogdump across the relevant files Analyze them using hand-written grep, sed, awk, uniq -c combinations.

Instead it seems like it would be just as easily to do this in a scalable repeatable way -- and one that also doesn't put production data on random laptops or elsewhere:

Set up a GCP instance that uses the standard bootstrap.sh to obtain the production decryption keys via KMS etc. Run wal-e to continuously fetch new logs as they appear -- or even set up pg_receivexlog to stream them Run xlogdump on new files as they appear Write a simple mtail script to keep interesting counters

That would be about the same as the above and about the same work too. It would have the benefit of being documented and repeatable just by rerunning the same bootstrap script and installing the same wal-e and mtail scripts from the repo.

Next steps:

The next step wouldn't be to leave the instance running indefinitely (add it to the chef scripts as a pet so it has the same security policies as any other machine) and then:

  • Target mtail with prometheus
  • Build a simple dashboard in Grafana showing the rates of the counters

That would not be very hard and let us quickly look whenever we want to understand a xlog spike where the extra xlog comes from. I think it would be a valuable addition to our dashboards.

The typical causes are:

  • vacuuming a table full of pages that need to be vacuumned (typically after a large batch update or delete)
  • vacuum freeze on a large table that hasn't need to be vacuumed recently (typically insert-only tables)
  • a large batch update or insert (typically a migration in our case)
  • heavy lock contention on foreign key locks
  • Index builds?
  • Full page write spikes after a checkpoint

So my thinking is the mtail script should keep counters for number of records and number of bytes indexed with two labels, record-type and relid. I don't think we need more detailed information about the block number or ctid etc. That would be a few hundred counters which should be no big deal for prometheus. I think it would also be good to have counts of full page images per-relid.