SEG for Git efficiency
Problem statement
Git data is one of our core stateful services at GitLab. It is presently stored on SSD's with dedicated gitaly compute nodes, due to how latency-sensitive the git service is. In order to make this resilient, we offer Gitaly Cluster which runs 3x gitaly nodes along with a load balancing server called Praefect, which utilizes its own tracking database. This architecture has been primarily driven by the nature of git, and our desire to provide a performant and resilient service without a SPOF, and finally to handle repos with high read loads.
This results in each Gitaly Cluster being a complex service with multiple pets (3 gitaly nodes, an additional db cluster, 2 Praefect nodes).
This leads to a few challenges:
- Complexity for GitLab administrators in managing a cluster of stateful services for git data with multiple sources of truth
- Significant engineering undertaking to develop and scale
- Expensive cost profile - each bit of git data is stored 3 times, and with Geo DR it is stored 6 times. This can lead to Git data costing >$2/GB.
Opportunity
While there are very good reasons for developing a service like Gitaly Cluster, and alternatives have been discussed at length (object storage: gitlab-org&479), it may make sense to make a small bet to see if one of these could pay off.
This issue proposes to use a Single Engineer Group, the incubation model, to make that small bet to see if we can store find a more efficienct method for storing git data. The benefits could be:
- Reduced cost profile as Object Storage is significantly cheaper
- Potentially simpler architecture, as some git nodes could be cattle instead of pets
- Potentially less administrative burden for GitLab administrators
The engineering complexity however may or may not be any easier.
Proposed success criteria
- $/GB < 0.50 (Including DR)
- Single state store (versus DB+Disk, should also reduce operational complexity)
- Horizontally scalable
- Performance of committing/cloning to
www-gitlab-com
at least as fast as .com today
Potential options
There are two main problem areas, cost and operational complexity. Some examples of potential avenues to explore are below.
Options for cost reduction:
- Extending Git to better handle object storage directly (maybe a more expansive version of gitlab-org&1487 (closed))
- Stateless caching layer in front of Object Storage
- Utilizing tiered storage, where perhaps inodes (most of git accesses) could reside on local disk, with actual content in Object Storage. This could be a variation / end-result of
Options for reduction in complexity:
- Use a different consensus mechanism than the shared and stateful Praefect DB (e.g. raft or paxos although may require 2PC/strong consistency to use read replicas)
- Stateless caching layer in front of Object Storage