Active active git replication

Many large customers have requested this and it is the obvious end state.

Current situation:

We have Geo replication
We'll soon have Geo DR
We can then make the push to the Geo secondary transparent (secondary relays it to the primary), so the user can push and pull from the secondary
The next step is active active

How active active works for the user:

I can push to any GitLab server
This server accepts my change locally
If my change is accepted I'm sure it will not be overwritten by another push
After a few minutes my change is available on all other servers

The state that needs to be synchronized between nodes to make this happen is the information that a git ref (or set of references) has been updated. As long as the references are updated in lock-step, the two git repositories on disc act as a single one in the face of concurrent reads or updates.

Git refs are a mutable key-value database store. There are several options for running a distributed K-V store across datacentres, and for providing a means of locking them for safe updates - most of which center on the raft protocol. Ultimately, the references need to be updated within the git repository, not just a database copy of them, though.

Once the refs are updated, the sites not receiving the initial git push are missing git objects necessary to service any subsequent git pull operations, but are able to handle git push (both with and without --force) correctly, preventing the nodes from going out of sync. They should retrieve those objects as quickly as possible from the site that has them, but this can happen without any synchronization concerns.

In the replication gap where the out-of-date nodes receive the objects they are missing, there are three possible behaviours they can present to users who attempt to git pull:

Reject the pull completely, or fail with an error message about a corrupted repository (default behaviour)
Serve the pull with the old, out-of-date references
Redirect the pull to the node that received the initial 'git push'

Of the three options, I prefer the second. We expect the replication gap to be short (in the order of seconds), and git users are accustomed to the idea that new changes might show up in the gap between git pull and git push.

Git ketch approach, decided against in the subsequent discussion

Using Git Ketch clusters + sharding for this would allow us to solve multiple problems:

We no longer need CephFS to scale GitLab.com (which is not production ready and requires deep C expertise), we might still run Ceph for object storage
We no longer need one giant volume (which is experimental in Ceph and hard to backup)
The Git Ketch servers can be simple (local HDD and SSD running ZFS)
There is no single point of failure in the filesystem (no Ceph monitoring nodes)
File system corruption is very unlikely, we just have to be careful with the rebalancing algorithm
The read latency should be small (load is distributed over multiple servers, no network coordination required)
We already know we can shard among multiple file servers, we're doing that today (although we must go from file system mounting to rpc)
We are in control of the rebalancing with repo granularity, getting all the advantages of http://githubengineering.com/introducing-dgit/
Customers can run this for active active without sharding (or with it), giving us the active-active our largest users need
GitLab Geo can switch to this setup to have local push + pull

/cc @stanhu @jacobvosmaer-gitlab @Haydn

Edited Apr 17, 2022 by Chris Kaburu