Implement a Raft-based decentralized architecture for Gitaly Cluster
The Gitaly team has been [discussing](https://gitlab.com/groups/gitlab-org/-/epics/8175) moving Gitaly Cluster to [Raft based replication architecture](https://en.wikipedia.org/wiki/Raft_(algorithm)) to solve the inconsistency issues affecting Gitaly Cluster. Through the discussion, it became clear that we can go a lot further with this and solve a large number of other problems as well by making more architectural changes enabled by the replication changes. We aim to: 1. Solve the variety of inconsistency issues Gitaly Cluster has. 2. Remove Praefect. 3. Remove Postgres. 4. Through an upgrade, make every Gitaly a cluster of one. The initial design document is available as a [Google Doc](https://docs.google.com/document/d/13dTh0AGCHjM9BSf80koqtUSLELUiZfLd7NWrx7m6NOE). # Status ## 2024-06-18 - [Gitaly's transaction support](https://docs.gitlab.com/ee/architecture/blueprints/gitaly_transaction_management/) is the first major deliverable coming out of this project. Raft builds upon them. - We've deployed transactions on staging. Things are generally stable and we're fixing small issues here and there. - We're focusing now on preparing transactions for production deployment in &13306+. - In &10328+, we've implemented a new approach to zero down time upgrades in Gitaly. The new approach relies on client-side retries to bridge over the restart process. This greatly simplifies the upgrade process and allows for having only a single Gitaly process operating on a storage as required by transactions. The new approach is deployed on staging, and is currently being rolled out to production. - With transactions maturing, we've started work on a proof of concept Raft implementation in https://gitlab.com/groups/gitlab-org/-/epics/13562+. ## 2024-03-28 - We're currently focusing on getting transactions deployed on staging in https://gitlab.com/groups/gitlab-org/-/epics/13304+. - We've continued maturing the transaction implementation, including switching over to a simpler and more flexible logging protocol in https://gitlab.com/gitlab-org/gitaly/-/issues/5793+. This enables us to support a wider variety of writes without adding more complexity in the logging protocol. - We hit some blockers in Git that had to be solved upstream. The blockers have been addressed and the fixes are included in Git v2.45. We're in process of finishing the blocked work in Gitaly in https://gitlab.com/gitlab-org/gitaly/-/issues/5770. We're currently running a patched version of Git v2.44 that contains these changes so we're not blocked behind release of Git v2.45. - Housekeeping has recently been integrated with transactions https://gitlab.com/gitlab-org/gitaly/-/issues/5733+. This was the last major piece of functionality missing from the transaction implementation. - Before deploying transactions to staging, we'd like to have them exercised in Rails specs and QA tests in https://gitlab.com/gitlab-org/gitaly/-/issues/5664+. We've hit an issue there due to https://gitlab.com/gitlab-org/gitaly/-/issues/5887+ and are currently focusing on fixing the issue. - Our goal is to deploy transaction on staging by the end of April 2024. This is just a target and may shift if further unexpected issues pop up. We're gathering issues blocking the staging deployment in https://gitlab.com/groups/gitlab-org/-/epics/13304+ - Once we're on staging, we'll start preparing for production deployment. We're gathering issues blocking production deployment in https://gitlab.com/groups/gitlab-org/-/epics/13306+ - Transactions will be the first feature released to customer out of this epic. They are the foundation the new cluster architecture is built on. ## 2023-10-26 - Since the last update, we've kept making progress on each of the topics in progress. - We've merged further changes to the core transaction logic and are now exercising the transaction in Gitaly's CI, with the almost all tests passing. We're working on merging support for object pools as that's the final larger topic which still isn't passing RPC tests. - Following https://gitlab.com/gitlab-org/gitaly/-/merge_requests/6496+, Gitaly can be run with transactions enabled for development. We can then begin running [other test suites](https://gitlab.com/gitlab-org/gitaly/-/issues/5664) as well with transactions enabled. ## 2023-09-21 Since the last update: - We've come up with [a design for ACID-transactions](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/122028) in Gitaly. Transactions: - bring major reliability benefits by eliminating entire classes of problems related to concurrency and interrupted writes. - enable functionality that was previously not possible such as online consistent backups - enable performance improvements. For example, pushes will scale by size of the push rather than the repository. On top of the above benefits, the new transaction design made it possible to transparently integrate transactions with the existing code in Gitaly. This has significantly sped up the integration effort. Raft replicates the write-ahead log. The [write-ahead log](https://gitlab.com/groups/gitlab-org/-/epics/8911) is part of the implementation of transactions. We're currently focusing on finishing the implementation and deploying transaction. The implementation is now far along and currently: - Is integrated with all RPCs except object pool related ones. - Runs RPCs with snapshot isolation. - Supports all write types except object pool related ones. - Write-ahead logs all writes. - Supports recovering from the write-ahead log after crashes. The core transaction implementation is still missing: - Support for object pool links/disconnects. - Housekeeping logic required to keep repositories in good shape by repacking objects, references and building indexes. - Log pruning logic to remove write-ahead log entries once no longer needed. - Serializability checks to prevent concurrent transactions from performing conflicting changes. - References updates are already conflict checked. - Repository creations, deletion, custom hook updates and default branch updates are not. - Logic to hold on to dependencies of logged pack files. Concurrently with the work on the core transaction logic, we're: - Upstreaming changes in Git to figure out dependencies of new objects (https://gitlab.com/groups/gitlab-org/-/epics/11242). This enables us to figure out the objects the write-ahead logged pack files depend on. - Removing legacy code from Rails that is using functionality Gitaly no longer supported with transactions: - Rugged, direct access from Rails to repositories. - `RenameRepository` RPC, and functionality around it - `NamespaceService`'s usage, and functionality around it - `SetFullPath`/`GetFullPath` RPCs and the functionality around it - Removing reliance on `info/gitattributes` file (https://gitlab.com/groups/gitlab-org/-/epics/9006+) - Working on a new deployment approach in https://gitlab.com/groups/gitlab-org/-/epics/10328+. The current one would have concurrent processes writing into the repositories which is not supported with transactions. ## 2023-03-24 We're focusing currently on implementing write-ahead logging in Gitaly. The progress can be tracked of the WAL implementation itself can be followed in its own epic at https://gitlab.com/groups/gitlab-org/-/epics/8911. We've also been furthering some of the pre-requisites for the architecture. These are great improvements on their own but also bring us closer to the new architecture: - https://gitlab.com/gitlab-org/gitaly/-/issues/4629+ (cc @justintobler) - Custom hooks were previously written directly into the repository, circumventing Gitaly. This change implements a Gitaly RPC for setting hooks, and a CLI tool to do so call it. Since the changes now go through the API, we can write-ahead log them, and replicate them on changes. - https://gitlab.com/groups/gitlab-org/-/epics/8953+ (cc @proglottis) - This removes the last piece of data that is being written into the repository's Git config. This means we don't have to support writing to git config through the WAL and don't have to replicate it. - https://gitlab.com/groups/gitlab-org/-/epics/9006+ (cc @knayakgl) - This enables git to read attributes from a blob which means we can read them from `HEAD` instead of needing to read them from a separate file on the disk. This eases our WAL implementation, as we don't have to separately WAL the attributes file nor handled MVCC of it. Those aspects can be handled using the general handling for references and objects. - https://gitlab.com/groups/gitlab-org/-/epics/8971+ (cc @qmnguyen0711) - We've been looking into client-side retry policies and load balancing in gRPC, and implemented [improvements](https://docs.gitlab.com/ee/administration/gitaly/praefect.html#service-discovery) in Praefect based on this work. The knowledge gathered here will help with implementing the client-side logic, and eventually the sidecar proxy in https://gitlab.com/groups/gitlab-org/-/epics/10170. - https://gitlab.com/gitlab-org/gitaly/-/issues/3780+ (cc @pks-gitlab) - There's on-going work on improving `git fetch`. It currently doesn't give us enough control to WAL the changes it's about to make. With the work that is being upstreamed, it will become possible to WAL the reference changes and the new objects it is about to write. https://gitlab.com/groups/gitlab-org/-/epics/9009+ is a topic that we're looking to pick up soon. We'll look more into how to handle object pools better but also on how they'll fit in with the WAL. (cc @pks-gitlab) ## 2022-10-14 The review of the design has been finished, thank you for the feedback! We still welcome feedback and questions about the design. With the review out of the way, we're moving onto verifying assumptions and implementing pre-requisites and parts of the design that are not blocked by other work. This is the top-level epic that tracks tasks associated with this work. We consider the write-ahead logging and replication performance as the main risk in the design and seek to verify it early on concurrently with the other work. This will initially be done by local tests and benchmarks. Once the results are good enough, we'll move to verify the performance in production.
epic