SQL for failover and leader election
Problem to solve
It is important that multiple Praefect proxies/routers can be run at the same time so that there is no single point of failure. These need to be consistent in choosing the same primary node else different Praefects will route requests inconsistently causing data loss.
Further details
There have been investigations in Consul #2037 (closed) #2458 (closed) for GitLab Omnibus, since this is a good application of Consul. However, Consul isn't likely to be ideal in a Kubernetes environment.
Since we already have a PostgreSQL database as part of the current architecture, and is seems like viable first iteration (spike !1883 (closed)) we will use this for providing failover and leader election. We plan to keep open the possibility for supporting Consul and K8s features for these problems.
Proposal
- We do some lightweight refactoring to prepare the Praefect code to support both SQL and Consul
- We aim to incorporate the SQL PoC approach first and roll out two Praefect nodes on GitLab.com.
- Focus on getting failover metrics etc. so we have more data on how this performs