Active active git replication
Many large customers have requested this and it is the obvious end state.
- We have Geo replication
- We'll soon have Geo DR
- We can then make the push to the Geo secondary transparent (secondary relays it to the primary), so the user can push and pull from the secundary
- The next step is active active
How active active works for the user:
- I can push to any GitLab server
- This server accepts my change locally
- If my change is accepted I'm sure it will not be overwritten by another push
- After a few minutes my change is available on all other servers
The state that needs to be synchronized between nodes to make this happen is the information that a git ref (or set of references) has been updated. As long as the references are updated in lock-step, the two git repositories on disc act as a single one in the face of concurrent reads or updates.
Git refs are a mutable key-value database store. There are several options for running a distributed K-V store across datacentres, and for providing a means of locking them for safe updates - most of which center on the raft protocol. Ultimately, the references need to be updated within the git repository, not just a database copy of them, though.
Once the refs are updated, the sites not receiving the initial
git push are missing git objects necessary to service any subsequent
git pull operations, but are able to handle
git push (both with and without --force) correctly, preventing the nodes from going out of sync. They should retrieve those objects as quickly as possible from the site that has them, but this can happen without any synchronization concerns.
In the replication gap where the out-of-date nodes receive the objects they are missing, there are three possible behaviours they can present to users who attempt to
- Reject the pull completely, or fail with an error message about a corrupted repository (default behaviour)
- Serve the pull with the old, out-of-date references
- Redirect the pull to the node that received the initial 'git push'
Of the three options, I prefer the second. We expect the replication gap to be short (in the order of seconds), and git users are accustomed to the idea that new changes might show up in the gap between
git pull and
Git ketch approach, decided against in the subsequent discussion
Using Git Ketch clusters + sharding for this would allow us to solve multiple problems:
- We no longer need CephFS to scale GitLab.com (which is not production ready and requires deep C expertise), we might still run Ceph for object storage
- We no longer need one giant volume (which is experimental in Ceph and hard to backup)
- The Git Ketch servers can be simple (local HDD and SSD running ZFS)
- There is no single point of failure in the filesystem (no Ceph monitoring nodes)
- File system corruption is very unlikely, we just have to be careful with the rebalancing algorithm
- The read latency should be small (load is distributed over multiple servers, no network coordination required)
- We already know we can shard among multiple file servers, we're doing that today (although we must go from file system mounting to rpc)
- We are in control of the rebalancing with repo granularity, getting all the advantages of http://githubengineering.com/introducing-dgit/
- Customers can run this for active active without sharding (or with it), giving us the active-active our largest users need
- GitLab Geo can switch to this setup to have local push + pull