Consider Consul as an inventory storage for praefect
The Gitaly HA design document currently states the following:
2. Introduce State
The following details need to be persisted in Postgres:
Primary location for a project Redundant locations for a project Available storage locations (initially can be configuration file)
Initially, the state of the shard nodes will be static and loaded from a configuration file. Eventually, this will be made dynamic via a data store (Postgres).
Handling an inventory of available nodes in a redundant setup seems to be the use case for which Consul is aimed at, so I want to propose it since it has several advantages for this use case over postgres, of which I'd like to point out:
- Integrated health check support: A health check (including gRPC health checks) can be defined to determine when a storage location becomes unhealthy and a failover needs to happen.
- Dynamic handling of nodes joining/leaving a cluster: Using the postgres approach we'd need a mechanism to keep the database up-to-date when a storage location becomes online or fails. Consul's gossip protocol instead allows nodes to join or leave a cluster without any additional operation in a data store, facilitating scaling up or down and failovers.
- Out of the box concensus and geo capabilities: using the Raft protocol Consul can achieve concensus between redundant servers, and be ready for the possibility of a future setup spawning multiple datacenters. We could achieve something similar in Postgres, possibly by using Patroni and Consul, but why do that if we can skip a step? ;)
- gitaly nodes are already part of a Consul cluster! As part of the work to do node inventory for our deploy process, we already have a Gitaly service registered in our Consul cluster, so the information the design document requires already exists, we just need to tap into it :)
- (of lesser but not nil importance) Familiarity of consul management for the infra team: we already use consul for HA configurations on our patroni and HAProxy setup. With postgres instead we'd have to develop an ad-hoc solution.
/cc @gl-gitaly @dawsmith