Majority wins voting strategy (beta)
Problem to solve
Gitaly Cluster allows Git repositories to be replicated on multiple warm Gitaly nodes. This improves fault tolerance by removing single points of failure. However, because write operations are currently replicated asynchronously, the GitLab server only has one copy of the change initially. Transactional write operations to Git repositories, added in GitLab 13.2, can be enabled but the current voting mechanism requires all nodes to agree. This means if a single node fails the write operation fails, creating a single point of failure.
A quorum based voting strategy is a more reliable voting mechanism that requires a majority of nodes to agree. When enabled, this means writes must succeed on multiple nodes, but the system can tolerate a minority of nodes failing. These nodes can then be recovered asynchronously from the nodes that formed quorum.
Further details
Considering a cluster with 3 nodes, 2 of 3 needs must agree to accept the write. This guarantees that a majority of the nodes have an up to date copy of the repository.
Considerations:
- if a node is unreachable due to outage, how does the system proceed?
- maybe the number of nodes required to reach quorum is calculated
minimum(n, (n DIV 2) + 1)
, wheren
is the number of reachable Gitaly nodes? - 1 reachable Gitaly node, 1 node to reach quorum
- 2 reachable Gitaly nodes, 2 nodes to reach quorum (consensus)
- 3 reachable Gitaly nodes, 2 nodes to reach quorum
- 4 reachable Gitaly nodes, 3 nodes to reach quorum
- 5 reachable Gitaly nodes, 3 nodes to reach quorum
- maybe the number of nodes required to reach quorum is calculated
Proposal
Implement support for a more robust quorum-based voting strategy.
The gitaly_reference_transactions_primary_wins
when set to false
will cause a quorum based strategy to be used.