Create runbook on how to replace an unhealthy replica from patroni cluster
Why is this change being made?
We are looking to improve our current response time and time to recover in case of unhealthy
As an initial start, I am creating this runbook to cover how to remove any random Replica from the patroni cluster and optionally replacing it by a new node;
Note: this is not the same as the scale down process, who only removes the last node;
Action item discussed at Firedrill - https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/15235
What the runbook should cover?
Steps should cover firedrill sessions at https://docs.google.com/document/d/1vAyB6nFjgTx7GokmEDkInVMV24INCw11z8Se0qU3KFw/edit#
Can be based on https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/patroni/scale-down-patroni.md
Draft
Drafting the runbook here https://gitlab.com/gitlab-com/runbooks/-/blob/rhenchen-update-backup-runbooks-2022-02-25/docs/patroni/unhealthy_patroni_node_handling.md