Praefect replication failures
Praefect's replications are error prone tasks that are at the mercy of the distributed system gods. Proper handling of replication failures is crucial to Gitaly HA objectives.
Previously, we incorrectly shutdown Praefect when replication failed (#2136 (closed)). Soon, we will simply log the replication failure (!1586 (merged)). However, logging isn't enough. The failure to replicate to a secondary implies the secondary is possible out of sync with the primary. Based on this, Praefect should take action, which could be one/all of the following:
- Alert operations that the secondary repo is out of sync (handled via logging in !1586 (merged))
- Reattempt the replication with backoffs
- Mark the replication job as failed in the datastore
- Reattempt replication with a different secondary
- In the datastore, mark the repo's replica as degraded or indicate the redundandy for a repo is at a lower value (e.g. 1 redundant copy vs 2)