Basic Gitaly node failure detection
Problem to solve
If a failure makes a Gitaly node unavailable, we should be able to detect this and mark the node as unavailable to that a new node can be promoted.
Further details
We should put our initial efforts into the cross over process, rather than detecting the failures. Support for detecting more complex failures is probably not needed for General Availability.
Proposal
As an administrator
- if I kill the Gitaly process on a Gitaly node, so that it is unreachable, GitLab should notice that the node is no longer reachable
- this data is being exposed through Prometheus, and logging
Links / references
Edited by Zeger-Jan van de Weg