Raft replication
#### Participants
- @qmnguyen0711
- @divya_gitlab
- @jamesliu-gitlab
- @echui-gitlab
## Milestones
- [x] Evaluate the feasibility of using an open-source Raft library. We picked etcd/raft after pivoting Dragonboat.
- [ ] Convert all partitions to single-node clusters
- [ ] Support adding read-only replicas to a partition
- [ ] Support adding electable replicas to a partition
## Goals and limitations
* Log replication works well in a static 3-node cluster
* Quiesce inactive Raft groups. This feature is crucial for the feasibility of Raft on production. It should be a part of the POC.
* Support cluster election and primary failover.
* Cluster is able to serve requests, but clients must route requests to the right storage.
* Observability (replica health, replication state) both for dashboarding and alerting.
* We'll need to keep the repository identity system that uses `storage` and `relative_path`. It's mostly because this system is deeply rooted inside Rails. This will be addressed in https://gitlab.com/gitlab-org/gitaly/-/issues/6104
* Support repositories created after the cluster is bootstrapped only. Existing repositories will be supported in https://gitlab.com/gitlab-org/gitaly/-/issues/6037 (Post-Raft).
## Not included
- Eventually consistent routing tables. Each storage knows its own partitions and replicated partitions only.
- No partition migration between storages.
- Full production readiness
- Content check-sum and re-sync if the partition state derives
- Stuff in https://gitlab.com/groups/gitlab-org/-/epics/10864+
## Execution plan

<!-- STATUS NOTE START -->
## Status 2025-04-08
HIGH_LEVEL_SUMMARY
Some work carrying from last week made progress. There is no major update this week. We are heading toward rolling single-node cluster on Staging while working on replications in parallel.
## :tada: **achievements**:
* The MR to [add support for smooth Raft enablement](https://gitlab.com/gitlab-org/gitaly/-/merge_requests/7746) was merged. We can now turn on/off Raft single-node cluster arbitrarily for an existing partition. This is an escape hatch allowing us to go back without data migration/discruption.
* The refactoring prerequisite (https://gitlab.com/gitlab-org/gitaly/-/merge_requests/7754) was merged. The integration of persistent routing table ([gitlab-org/gitaly#6675](https://gitlab.com/gitlab-org/gitaly/-/issues/6675)) has been resumed.
* https://gitlab.com/gitlab-org/gitaly/-/merge_requests/7647+ was merged after a long review progress. That work facilitates adding a new Raft replication.
## :issue-blocked: **blockers**:
* Snapshotting is blocked by https://gitlab.com/gitlab-org/gitaly/-/issues/6675 and https://gitlab.com/gitlab-org/gitaly/-/issues/6643. We cannot integrate and write tests for snapshotting if there's only one member in a Raft group.
* Solution: split https://gitlab.com/gitlab-org/gitaly/-/issues/6643; we don't need a full membership management at the moment. Extracting out the ability to add replicas unlocks snapshotting.
* More a foreseen performance problem than blocking. When working on https://gitlab.com/gitlab-org/gitaly/-/merge_requests/7778, we noticed that Raft life cycle is tight to the life cycle of partition. A partition is closed after being inactive for a while. This leads to latency overhead when restarting that partition due to initial election.
* Solution: This problem will be addressed by implementing [quiescing feature](https://gitlab.com/gitlab-org/gitaly/-/issues/6035). That said, we might consider prioritize it earlier if the overhead is prohibitively expensive.
## :arrow_forward: **next**:
* Benchmark and test the performance of Raft single-node cluster.
* [Add replica placement support to Gitaly node manager](https://gitlab.com/gitlab-org/gitaly/-/issues/6647) is being picked up. That issue tracks the refactoring of WAL directory structure to support hosting foreign replicas on the same node.
* [Add debugging utilities](https://gitlab.com/gitlab-org/gitaly/-/issues/6676) supporting Staging rollout and local development.
* Complete wiring of persistent routing table to Raft manager ([gitlab-org/gitaly#6675](https://gitlab.com/gitlab-org/gitaly/-/issues/6675)) to enable membership tracking.
FYI @jamesliu-gitlab @echui-gitlab @divya_gitlab
_Copied from https://gitlab.com/groups/gitlab-org/-/epics/13562#note_2438858350_
<!-- STATUS NOTE END -->
epic