馃獊 Gitaly Clusters variable replication factor
<!-- triage-serverless v3 PLEASE DO NOT REMOVE THIS SECTION -->
*This page may contain information related to upcoming products, features and functionality.
It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes.
Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.*
<!-- triage-serverless v3 PLEASE DO NOT REMOVE THIS SECTION -->
## Problem to solve
It is difficult to scale Gitaly to a large amount of Git data progressively. There are multiple reasons to scale Gitaly: increase storage capacity, increase resources (CPU/memory), and increase fault tolerance.
A number of features exist to help scale an instance, but they are cumbersome to manage. These include:
- shards: repository storages, shown in the Admin interface. A project, and all it's repositories must be stored on the same shard. This can be changed using the project API.
- clusters: a shard can be a cluster to Gitaly nodes. All nodes in the cluster are replicas of each other. If there are _n_ nodes, the replication factor is _n_.
## Direction
For most customers with a single class of storage (e.g. all Gitaly nodes in the fleet will be identical), it should be possible to simply add more Gitaly nodes, and Gitaly will automatically rebalance within one cluster.
For the very largest customers, like GitLab.com, who need multiple classes of storage, each cluster can represent a class of storage. Distinct clusters could also be used for data isolation.
## Further details
### Approaches
**Praefect managed:** elastic cluster, replication factor configured independently to number of Gitaly nodes
Benefits:
- zero down-time rebalancing
- one cluster will be enough for most customers
- incrementally grow cluster one node at a time
Cons:
- layer of indirection: since a repo could be on any Gitaly node in a cluster, filesystem based administration is more difficult. Tooling may need to be built for certain tasks.
**Rails managed:** automatic rebalancing, replication factor is equal to the number of Gitaly nodes in cluster
Benefits:
- fewer primitives, only have to worry about shards and balancing between them
Cons:
- downtime during migration
- large instances will require many shards
- scaling is more costly, larger steps
### Challenges
- **Migration:** how does a multi shard instance migrate to a few Gitaly Clusters?
- It might be useful to have some special tools for combining Gitaly nodes into a cluster, and then pulling them apart if needed.
- **Administration:** what options does an administrator have if Gitaly Clusters is performing poorly? Automatic re-balancing sounds great when it works, but what if it stops working?
- Noisy neighbor: distributed reads should help, assuming the read load is a problem. If repositories are well mixed with a replication factor greater than 1, and routing was smart enough it might help mitigate the problem
- Workaround: move noisy repo to isolated shard/cluster to prevent problems - this is the status quo
- Observability: we'll need good monitoring to see if Gitaly is re-balancing itself well.
## Proposal
- Simple Gitaly Cluster re-balancing - captured as a separate feature under https://gitlab.com/groups/gitlab-org/-/epics/5905
Replication factor of 1. Multiple Gitaly nodes. Repositories are created and moved automatically to balance disk utilization.
- Replication factor greater than one (per cluster configuration)
- Varying replication factor within a cluster (maybe?)
For large high activity repositories like `gitlab-org/gitlab`, or a customers primary repository, it might be beneficial to be able to designate higher replication factors. This might be automated, or manual.
## Links / references
epic