Gitaly Cluster: strong consistency
<!-- triage-serverless v2 PLEASE DO NOT REMOVE THIS SECTION -->
*The following page may contain information related to upcoming products, features and functionality.
It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes.
Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.*
<!-- triage-serverless v2 PLEASE DO NOT REMOVE THIS SECTION -->
### Release notes
Gitaly Cluster allows Git repositories to be replicated on multiple warm Gitaly nodes. This improves fault tolerance by removing single points of failure. [Reference transactions](#gitaly-cluster-reference-transactions), introduced in GitLab 13.3, causes changes to be broadcast to all the Gitaly nodes in the cluster, but only the Gitaly nodes that vote in agreement with the primary node persist the changes to disk. If all the replica nodes dissented, only one copy of the change would be persisted to disk, creating a single point of failure until asynchronous replication completed.
Quorum-based voting improves fault tolerance by requiring a majority of nodes to agree before persisting changes to disk. When the feature flag is enabled, writes must succeed on multiple nodes. Dissenting nodes are automatically brought in sync by asynchronous replication from the nodes that formed the quorum.
Documentation: https://docs.gitlab.com/ee/administration/gitaly/praefect.html#strong-consistency
### Problem to solve
When a user pushes changes to GitLab, if we accept the changes we should have a sufficient number of replicas before we communicate success to the client to prevent unexpected data loss where the write isn't replicated to sufficiently before the primary fails.
### Further details
Customers using NFS for HA expect a similarly consistent consistent solution, not an eventually consistent backup.
<details><summary>Possible approaches:</summary>
- :x: Fully Praefect managed https://gitlab.com/gitlab-org/gitaly/-/merge_requests/1863
- Pro: Single code base changes, Migration can be done piece by piece
- Con: Migration is done piece by piece, Complexity for Gitaly, as well as Praefect
- :x: Git update-ref DSL changes
- Mailinglist changes: https://lore.kernel.org/git/cover.1585129842.git.ps@pks.im/, https://lore.kernel.org/git/cover.1585811013.git.ps@pks.im/
- Pro: Can be upstreamed, Migration can be done piece by piece
- Con: Requires all RPCs to use git-update-refs to make use of it, Migration is done piece by piece
- :white_check_mark: Git hooks based, introduce new hooks https://gitlab.com/gitlab-org/gitlab-git/-/tree/pks-ref-transaction-hooks https://gitlab.com/gitlab-org/gitaly/-/issues/2529#note_312712916
- Pro: Can be upstreamed, Catches all known Git commands that do ref updates, Can be implemented mostly transparently to the caller of a given command
- Con: Commands doing multiple reference updates will have to do multiple votes
- :x: Update ref.c with GitLab specific logic
- Pro: No dependency on hooks
- Con: Maintaining a patch, reapplying each Git version, Harder to ship throughout GitLab (CNG, GDK, omnibus, etc)
Intention: **hooks** are the preferred approach. Upstream new hooks once we've built and MVC and improved.
</details>
### Proposal
- [x] **Iteration 1:** 2PC pre-receive hook proof of concept https://gitlab.com/gitlab-org/gitaly/-/issues/2635 %"13.0"
- [x] **Iteration 2:** 2PC hooks MVC %13.1
Iterate on the proof of concept pre-receive hook approach to include **all write operations** by creating new hooks for ref updates, implement **proxying** of writes, and monitoring.
- [x] **Iteration 3:** Improvements and Performance %13.2
Maybe Stream-wise proxying and faster checksums
- [x] **Iteration 4:** Generally available %13.3
### Links / references
epic