Skip to content

Majority wins voting strategy (beta)

Problem to solve

Gitaly Cluster allows Git repositories to be replicated on multiple warm Gitaly nodes. This improves fault tolerance by removing single points of failure. However, because write operations are currently replicated asynchronously, the GitLab server only has one copy of the change initially. Transactional write operations to Git repositories, added in GitLab 13.2, can be enabled but the current voting mechanism requires all nodes to agree. This means if a single node fails the write operation fails, creating a single point of failure.

A quorum based voting strategy is a more reliable voting mechanism that requires a majority of nodes to agree. When enabled, this means writes must succeed on multiple nodes, but the system can tolerate a minority of nodes failing. These nodes can then be recovered asynchronously from the nodes that formed quorum.

Further details

Considering a cluster with 3 nodes, 2 of 3 needs must agree to accept the write. This guarantees that a majority of the nodes have an up to date copy of the repository.

Considerations:

  • if a node is unreachable due to outage, how does the system proceed?
    • maybe the number of nodes required to reach quorum is calculated minimum(n, (n DIV 2) + 1), where n is the number of reachable Gitaly nodes?
    • 1 reachable Gitaly node, 1 node to reach quorum
    • 2 reachable Gitaly nodes, 2 nodes to reach quorum (consensus)
    • 3 reachable Gitaly nodes, 2 nodes to reach quorum
    • 4 reachable Gitaly nodes, 3 nodes to reach quorum
    • 5 reachable Gitaly nodes, 3 nodes to reach quorum

Proposal

Implement support for a more robust quorum-based voting strategy.

The gitaly_reference_transactions_primary_wins when set to false will cause a quorum based strategy to be used.

Edited by Zeger-Jan van de Weg