馃審 HA Gitaly (MVC)
## Problem to solve
Availability and redundancy of Git data is critically important for GitLab instances so that users can always access their data, and that in the event of a node failure data loss is prevented. In a cloud native configuration is only possible to scale a Gitaly node vertically.
Through objects storage and database replication all other data storage in GitLab have this protection, but Gitaly nodes do not.
## Progress
- [x] 12.4 (Oct 2019) - Minimal [Praefect](https://docs.gitlab.com/ee/administration/gitaly/praefect.html) configuration in Staging gitlab-com/gl-infra/production#1223
- [x] 12.6 (Dec 2019) - Minimal [Praefect](https://docs.gitlab.com/ee/administration/gitaly/praefect.html) configuration in Production gitlab-com/gl-infra/production#1473
- [x] 12.7 (Jan 2019) - Observe [Praefect](https://docs.gitlab.com/ee/administration/gitaly/praefect.html) replication in Production https://gitlab.com/gitlab-com/gl-infra/production/issues/1605
- [x] 12.8 (Feb 2020) - Demonstrate automatic failover
- [x] 12.9 (Mar 2020) - Alpha https://gitlab.com/groups/gitlab-org/-/epics/2659 https://about.gitlab.com/releases/2020/03/22/gitlab-12-9-released/#high-availability-for-gitaly-alpha
- [x] 12.10 (Apr 2020) - Beta https://gitlab.com/groups/gitlab-org/-/epics/2657 https://about.gitlab.com/releases/2020/04/22/gitlab-12-10-released/#high-availability-for-gitaly-beta
- [x] 13.0 (Mar 2020) - Generally available https://gitlab.com/groups/gitlab-org/-/epics/2658
:exclamation: :handshake: We are beginning to investigate **strong consistency** https://gitlab.com/groups/gitlab-org/-/epics/1189, which will be the immediate priority following this eventual consistency MVC.
## Further details
Git storage at GitLab has now fully switched over to go through Gitaly, which is a server that provides a git implementation agnostic interface to all our git data. For this service, the availability that is tracked is based on RPC responses. However, if the server has a hard disk failure, or clients are unable to connect to a certain Gitaly storage due to a network partition, this SLA drops with 1/nth where n is the number of storage shards in use.
This is the case as the current implementation of Gitaly lacks logic to maintain 1 or more replicas of Git repositories. And even if there were, there is no fail over logic either at this time.
To bring it back to the analogy [introduced by Gavin McCance](https://image.slidesharecdn.com/cerndatacentreevolution-sdcd2012-121119074533-phpapp02/95/cern-data-centre-evolution-17-1024.jpg?cb=1427086216) each Gitaly server right now is a pet, where it would be beneficial to have them be cattle. This means that GitLab, as a whole, should remain available if any of the storage nodes fails, for whatever reason. Thus achieving high availability, further referred to as HA. This is done through replicating the data, and maintaining these replica's so their data is always current.
## Proposal
As a GitLab administrator, using an existing GitLab instance, I should be able to:
- create a new [Praefect cluster](https://docs.gitlab.com/ee/administration/gitaly/praefect.html)
- migrate individual repositories from their current Gitaly node to the Praefect cluster using the storage API
- when repositories stored within the Praefect cluster are created, modified, or deleted, on the primary node in the Praefect cluster, this should be replicated to the replica nodes in the cluster
- if the primary node in the cluster becomes inaccessible, Praefect should fail over to a replica and make it the primary
<details><summary><strong>Replication</strong></summary>
There will be a configurable target for the number of replicas for each repository. The number of replicas may not match the target replication level (e.g. initial state with no replication, node failure requiring new replicas to be created). This will have to be detected, so new replicas can be created.
Creation will be done through an inter Gitaly RPC, which are available at the time of writing. Once this is done, the replica needs to be verified, to guarantee it being in the same state as the primary copy. If this is not the case, the replica should try to update the state, until it is. When it is, it should be marked as such, and receive all subsequent write RPCs, so it remains up-to-date.
The crux of this flow matter lies in coordination, thus making sure that either only one process is scheduling the creation and destruction of replicas, or multiple are, and the processes have coordination.
</details>
<details><summary><strong>Fail over</strong></summary>
As the synchronization is happening based on either a shard or repository, failover will have to follow this pattern. The failover procedure will have to be trigger based, as e.g. having long polling on the state of one Gitaly could take a long time and meanwhile read and write ops might still be occurring.
Marking a shard as down could happen based on the number of write operations that fail in quick succession. And given these operations are end to end tests of a functioning system it's a good indication of health and liveliness. Where the health check and liveliness checks as implemented now might say little to nothing as an end to end metric of system health. However, this might be a first iteration possibility.
Self repairing, or self healing properties are not part of the MVC and thus will not be discussed and defined in this issue.
</details>
## Design
We presume GitLab's SQL database is available. When SQL is not available, reads and writes from/to HA Gitaly will fail.
For detection that new replications are required, and when fail over is needed a coordinator is required.
The preferred approach is a proxy/router in Golang that sits between the clients and Gitaly, because it requires few changes to Gitaly, has a stable gRPC implementation, and has better performance characteristics than Ruby. See RFC https://gitlab.com/gitlab-org/gitaly/issues/1335 for discussion of other approaches.
Introduce a **reverse proxy** called _Praefect_ to manage replication and failover between a cluster of Gitaly nodes, called a _Gitaly cluster_. The reverse proxy will use a Postgres database to track state so that Praefect is stateless. Each Gitaly cluster will appear as a simple Gitaly storage location to the GitLab application.
It should be possible for Praefect to be scaled horizontally, so that any Prafect node can service a request to any Gitaly cluster.
| Alpha | Beta/GA |
|---|---|
|  |  |
<details><summary>Old Mermaid diagram</summary>
```mermaid
graph TB
GitLab-Rails --> Praefect-1;
subgraph Git-Tier
Praefect-1 --> Praefect-Database;
Praefect-1 --> Praefect-1-Gitaly-1;
Praefect-1 --> Praefect-1-Gitaly-2;
Praefect-1 --> Praefect-1-Gitaly-3;
subgraph Gitaly-Cluster
Praefect-1-Gitaly-1;
Praefect-1-Gitaly-2;
Praefect-1-Gitaly-3;
end
end
style Git-Tier fill:#FFF,stroke-dasharray:5
```
</details>
## Links / references
- https://gitlab.com/gitlab-com/Product/issues/502
- https://gitlab.com/gitlab-com/Product/issues/501
epic