Smarter Praefect election process
Problem to solve
Right now our primary election process in Praefect treats all Gitaly nodes that are up as possible candidates for election. However, what happens if a secondary fails replication or becomes out of sync? We need to consider marking the Gitaly node as "unelectable" and taking it out of the rotation.
Further Details
We probably need to consider a number of inputs:
- Are there any jobs left in the replication queue? If there is a node that is up that has an empty queue, we'd prefer that one over another one.
- Failure rates of the replication queue: If a node is unable to sync with the current primary for a number of repositories, that is indicative that something went out of sync, and we may need to do a repair.
Ideas for consideration:
- During the election process, sort nodes by descending number of outstanding jobs.
- Track a metric of the
FetchInternalRemoteand/orReplicateRepositoryRPCs in Praefect
Proposal
When promoting a secondary to a new primary, use the Gitaly node with least pending replication jobs.
Links / references
Edited by James Ramsay (ex-GitLab)