Quarantine snapshots for failed Postgres recoveries
Incident production#6429 (closed) caused a delay in Data ingestion into the data warehouse.
- This event happens rarely (maybe twice a year), where Postgres gets "stuck" during recovery and never finishes. This is why #3 (closed) implemented retry logic.
- We have seen this problem and try to debug it, but our observation were that Postgres is stuck on
selectwaiting for a WAL file which we verified was present (and wasn't the last one) - Because there are time constraints associated with the recovery so that the Data Team can pull data, we only investigated briefly.
- Quarantined snapshots and cloning from them might allow us to investigate the problem in detail without blocking the Data Team
Edited by Gerardo Lopez-Fernandez