🌏 Gitaly Cluster (#1489) · Epics · GitLab.org

🌏 Gitaly Cluster

*This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.*  ## Vision Running Gitaly in a High Availability configuration should be straightforward and not meaningfully reduce the performance of Gitaly or GitLab. When Gitaly is run in a High Availability it should be quick and easy to failover in the event of a repository failure, server failure, or outage and achieve Gitaly uptime of 99.999% (five 9's). We should be iterating towards a final solution that customers are confident is **better than NFS in performance, reliability, and administration.** ## :rocket: Status - :earth_africa: Eventual consistency (async replication) [available in GitLab 13.0](https://about.gitlab.com/releases/2020/05/22/gitlab-13-0-released/#gitaly-cluster-for-high-availability-git-storage) (https://gitlab.com/groups/gitlab-org/-/epics/842) - :herb: Distribute reads across replicas [available in GitLab 13.3](#) (https://gitlab.com/groups/gitlab-org/-/epics/2013) - :handshake: Strong consistency https://gitlab.com/groups/gitlab-org/-/epics/1189 - Strong consistency is complete for Gitaly Cluster. Additional improvements are being made to further the experience for our users. - 🪁 Variable replication factor https://gitlab.com/groups/gitlab-org/-/epics/3372 ## In progress/Next up - [ ] :unicorn: Improved availability and performance https://gitlab.com/groups/gitlab-org/-/epics/2983 ## Key Requirements Immediate objectives: - **Works for Omnibus and Kubernetes.** Most of our customers use Omnibus, but the lack of Gitaly HA is obstructing Kubernetes adoption. We shouldn't make choices the isolate either. - **Works on GitLab.com from day one.** It should be possible to validate, and test our progress as we go on GitLab.com, so that we are getting feedback from SRE's and understanding the performance characteristics of what is being built. - **Automatic Failover** is mandatory. Without automatic fail over, it is not possible to achieve [99.95](https://uptime.is/99.95) monthly uptime should an even occur. Most customers are aiming for [four nines](https://uptime.is/99.99), if not [five nines](https://uptime.is/99.999). - For lower SLA's it would be possible to trigger a failover using Geo. This is the approach taken by GitHub. - **Predictable data loss** is mandatory. Automatically failing over on an eventually consistent system will guarantee the loss of any unreplicated data. If a failover occurs, it should be know which projects there was data loss. - **Orthogonal to existing features** - **Git object deduplication** - **Geo compatibility** is mandatory. Large customers investing in high availability, are also investing in disaster recovery solutions. They are different tools for different problems, but for organizations treating GitLab as a critical service, both need to work together. Direction (we will iterate towards this) - **Transactional Commits** are complex and difficult, but we are iterating toward them. When GitLab is running in a highly available configuration, if a commit is pushed to GitLab successfully, we should be confident that an immediate single node failure will not result in data loss. - This is similar to GitHub Spokes, and is referenced by customers as their objective. GitHub have announced that they are working to bring High Availability to GitHub Enterprise, and it is most likely to be derived from their existing HA implementation. ## Out of scope - **Zero-down time maintenance features** are not planned. It should be possible to put the whole instance into maintenance mode to prevent writes, allow replicas to catch up, terminate the primary, and then disable maintenance mode. This may be reconsidered in the future. - **Disaster Recovery** is different to High Availability. Where HA would cover one server having a power outage, DR would cover the whole data center having a power outage. See https://docs.gitlab.com/ee/administration/geo/replication/index.html ## Customers - https://gitlab.my.salesforce.com/00161000013aRjG ~"customer+" - https://gitlab.my.salesforce.com/00161000004yLEy ~"customer+" - https://gitlab.my.salesforce.com/00161000004bZPD ~"customer+" - https://gitlab.my.salesforce.com/0016100001CXro6 ~customer - https://gitlab.my.salesforce.com/0016100000NmU19 ~customer - https://gitlab.my.salesforce.com/00161000004zrF8 ~customer - https://gitlab.my.salesforce.com/00161000004bZxf ~customer - https://gitlab.my.salesforce.com/00161000003QssT ~customer - https://gitlab.my.salesforce.com/00161000004yxj9 ~customer - https://gitlab.my.salesforce.com/0016100001F4xsr ~customer - https://gitlab.my.salesforce.com/0016100001ZRVuq ~customer - https://gitlab.my.salesforce.com/0016100000dlBPy ~customer - https://gitlab.my.salesforce.com/00161000002wZ7Z ## Links / references

epic