Quarantine snapshots for failed Postgres recoveries

Incident production#6429 (closed) caused a delay in Data ingestion into the data warehouse.

  • This event happens rarely (maybe twice a year), where Postgres gets "stuck" during recovery and never finishes. This is why #3 (closed) implemented retry logic.
  • We have seen this problem and try to debug it, but our observation were that Postgres is stuck on select waiting for a WAL file which we verified was present (and wasn't the last one)
  • Because there are time constraints associated with the recovery so that the Data Team can pull data, we only investigated briefly.
  • Quarantined snapshots and cloning from them might allow us to investigate the problem in detail without blocking the Data Team
Edited by Gerardo Lopez-Fernandez