Smarter Praefect election process

Problem to solve

Right now our primary election process in Praefect treats all Gitaly nodes that are up as possible candidates for election. However, what happens if a secondary fails replication or becomes out of sync? We need to consider marking the Gitaly node as "unelectable" and taking it out of the rotation.

Further Details

We probably need to consider a number of inputs:

  1. Are there any jobs left in the replication queue? If there is a node that is up that has an empty queue, we'd prefer that one over another one.
  2. Failure rates of the replication queue: If a node is unable to sync with the current primary for a number of repositories, that is indicative that something went out of sync, and we may need to do a repair.

Ideas for consideration:

  1. During the election process, sort nodes by descending number of outstanding jobs.
  2. Track a metric of the FetchInternalRemote and/or ReplicateRepository RPCs in Praefect

Proposal

When promoting a secondary to a new primary, use the Gitaly node with least pending replication jobs.

Links / references

Edited by James Ramsay (ex-GitLab)
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information