Corrective action: The Patroni snapshot script should not over-run itself
Summary
During the incident, it was discovered that there were three instances of the GCS Snapshot script running. This may have contributed to the performance of the VM that fell behind in replication lag.
Related Incident(s)
Originating issue(s): production#7250 (closed)
Desired Outcome/Acceptance Criteria
We should update this script to avoid multiple copies running, or wrap it in a time limited way to avoid the snapshot process from impacting the postgres service.
We could use something like flock
to detect and avoid starting a new process if the old one is still running. Or, another approach might be to interrupt or stop the process if it has run too long. The second option may be preferable because it would help limit the amount of time the postgres service is in a backup_start
mode. But it may be more complicated to implement.
Associated Services
Corrective Action Issue Checklist
-
Link the incident(s) this corrective action arose out of -
Give context for what problem this corrective action is trying to prevent from re-occurring -
Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4') -
Assign a priority (this will default to 'priority::4')