Scope cut idea: Leverage Consul for fail over
Problem
Currently the Gitaly team is working towards an application level fail over for Praefect. The current plan is to ping the Gitaly nodes for their status, and when a node doesn't respond for a while a new Praefect is elected as primary. This is an Praefect alpha requirement, and works for an alpha while there's only one Praefect.
Once we have to support multiple Praefects, we're opening a can of worms, as the number of scenarios to consider increase massively. Each Praefect might have a different view of the world, might fail over to a different new prima
This will increase complexity for our project, while there's a component in the gitlab.com architecture already to leverage: Consul. Consul can provide service discovery, that is; find Gitaly nodes available and healthy.
Pros
- Removes a lot complexity from Praefect due to the shift of responsibilities
- Consul exists to solve this problem
- Potentially saves many milestones of development and verification to get a strong implementation
Cons
- Another component to install for our customers
/cc @alejandro @jramsay @johncai @pokstad1 @jacobvosmaer-gitlab