RFC: Gitaly HA Design
After requirements have been discussed in #1332 (closed) the design can be further discussed, among the first implementation details. Before discussing the systems architecture, introducing trade offs based on previous discussions:
Write duplication
One of the requirements set, is having similar or better performance to an NFS cluster with HA abilities. That means that after the initial creations of replica's the replication leg has to be minimal. Each write operation would increase this leg between the primary and secondaries. If any primary node goes down for whatever reason, the replicas have no option to get the data and thus data is lost.
At the time of a write, each of the write operations needs to be duplicated to each shard holding a copy of the repository subject to the write. Thus requiring a mechanic that allows for separating read and write RPCs at runtime.
Incremental roll out
HA capabilities are requested by small and larger customers, maintaining small and larger instances. When the MVC of Gitaly HA is deployed to a larger instance like gitlab.com, it would create a large operational risk if this feature does not have the ability to roll out incrementally. For example, it would be hard to predict the load the initial replication creates, or the replication of writes.
Allowing to roll out incrementally, for example on shard level, allows for an initial experiment phase with a contained set of repositories.
Another note, the MVC for Gitaly HA will most likely not include load balancing to replicas. Again to limit the feature set, and thus limiting the risks involved. Load balancing would increase performance, which is not a core requirement. If customers would request it, it should be considered at that time.
Process overview
The two most important processes of Gitaly HA, are both the initial replication and fail over. These processes are described below to illustrate what capabilities are required.
Replication flow
Replication levels, that is the number of replicas for each shard/repository, will not always match the number of replicas. This will have to be detected, so another replica can be created. Creation will be done through an inter Gitaly RPC, which are available at the time of writing. Once this is done, the replica needs to be verified, to guarantee it being in the same state as the primary copy. If this is not the case, the secondary should try to update the state, until it is. When it is, it should be marked as such, and receive all subsequent write RPCs, so it maintains the up-to-date state.
The crux of this flow matter lies in coordination, thus making sure that either only one process is scheduling the creation and destruction of replicas, or multiple are, and the processes have coordination.
Fail over flow
As the synchronization is happening based on either a shard or repository, failover will have to follow this pattern. The failover procedure will have to be trigger based, as e.g. having long polling on the state of one Gitaly could take a long time and meanwhile read and write ops might still be occurring.
Marking a shard as down could happen based on the number of write operations that fail in quick succession. And given these operations are end to end tests of a functioning system it's a good indication of health and liveliness. Where the health check and liveliness checks as implemented now might say little to nothing as an end to end metric of system health. However, this might be a first iteration possibility.
Self repairing, or self healing properties are not part of the MVC and thus will not be discussed and defined in this issue.
Design
For storing information a HA database is assumed present, and used.
For detection that new replications are required, and when fail over is needed a coordinator is required. This coordinator could perform its duties in four distinct places in our architecture:
- Inside a Gitaly Mesh
- Each Gitaly can communicate to any other Gitaly, and uses consensus to as a cluster determine what shard is healthy or not, which shards will store which replicas.
- Coordinator service, connected to each Gitaly
- Requests go to any Gitaly, where each Gitaly will redirect/proxy to the required peer Gitaly's
- Coordinator, between the clients and Gitaly's
- Each client connects to the coordinator, which will proxy/redirect.
- Rails managed coordination
- Our Rails code maintains all git storage details and schedules replication Each operation first requires an internal api call to know what Gitaly to talk to
To start of with the mesh approach, it would create considerable complexity to initially create and maintain a mesh. Now Gitaly is a simple component that perform no action without an RPC being requested. This wouldn't be the case anymore with a mesh which transfers stability risks to the component.
The coordinator behind the Gitaly will too increase the complexity of the Gitaly component, although the request-respond cycle would be maintained. Routing however, will require knowledge of which Gitaly is in what state, to prevent requests going to shards that are not responding. Given Gitaly HA should work on both omnibus installs and orchestrators not all services that Kubernetes provides can be relied upon.
Neither of those two options, a mesh network, or a hidden coordinator would I deem acceptable because of complexity issues in operating/routing, and adding significant changes to Gitaly.
Rails is problematic in both bookkeeping of when a failover is required, and as it would need to either poll for liveliness and health checks, or maintain knowledge on each RPC. Effectively making it a proxy with a 60 second time limit due to unicorn. With the possible performance penalty that incurs. Advantages include the current ability of scheduling jobs, and having the required knowledge of which shard what data is stored already. It will add additional complexity to an arguably already complex component.
The third option, basically the Gitaly Router approach, allows for abstracting a Gitaly cluster from all clients, while request routing would stay a simple affair. It has additional benefits, like requiring near to none changes to Gitaly. But also free choice of technology, for example writing in Golang. Golang has a stable gRPC implementation and trumps Ruby in terms of performance. This property allows for using the advanced features of the framework.
Please let me know what you think, and where you disagree or what requires additional details.
/cc @jacobvosmaer-gitlab @andrewn @stanhu @nick.thomas @alejandro @jramsay @tommy.morgan