Firedrill - Patroni replicas unhealthy
Scenario
In this scenario we want to cover the actions necessary to mitigate any generic incident where 1 or more Replicas are considered unhealthy
We should cover the following tasks:
- What existing evidences/metrics represent that a replica is considered unhealthy
- How to figure which nodes are affected
- How to mitigate the issue to reduce customer impact
Example of related incident: production#6253 (closed)
Meeting Format
There should be 2 sessions, 1 in EMEA and 1 in AMER
According with the discussion in the 2022-02-15 Incident Review, we'll perform the fire drill first, in order to figure out what needs to be in the runbook, rather than writing the runbook and then running a fire drill against it.
The notes taken in both sessions should contain the details of what the runbook must cover for this kind of incident.
-
Moderators: TBD in EMEA, TBD in AMER -
Note Taker: TBD
Acceptance Criteria
-
Google Doc created: https://docs.google.com/document/d/1vAyB6nFjgTx7GokmEDkInVMV24INCw11z8Se0qU3KFw/edit?usp=sharing -
Meeting scheduled; Agenda should include - Google Doc
- Link to Scenario
-
Meeting must be recorded -
Recording is uploaded to YouTube; apply the video to the following playlists: - Infrastructure Fire Drills
- Infrastructure Group
-
Mark the video as private if any Yellow and above classified data is shared -
Review the Google Doc and/or the Video for any potential follow up issues that need to be resolved
Edited by Rafael Henchen