Full project mirroring between GitLab instances

Description

GitLab has ~Geo, which is a product for multi-region replication of GitLab data. This includes all database contents as well as files and repository + wiki data.

Geo has a "selective sync" feature, which is used to replicate a subset of an instance elsewhere.

GitLab is gaining a "bidirectional sync" feature: https://gitlab.com/gitlab-org/gitlab-ee/issues/3745 - this can be used as a sort of poor man's multi-region, multi-master replication of a subset of repositories between two non-Geo GitLab instances, but files and database contents (issues, MRs, memberships, etc) aren't part of this.

Proposal

Enhance bidirectional replication with instance, namespace and project-level federation of database contents and files. We could start by only supporting it at project-level though.

An admin or owner on gitlab-a.com would also have an account on gitlab-b.com. They would set up an instance, group or project-level integration on the latter, using a personal access token from the former.

Whenever a change happens on one instance, it is replicated asynchronously to the other, using webhooks to notify that a change has happened. Obviously, conflicts can occur, as we see with bidirectional repository mirroring. We may need an explicit federation object on both sides to support read-write on both sides; if set up on only one side, it could act as a read-only replica.

A major source of conflicts in the multi-master version would be IIDs of issues, MRs, etc. This can be worked around using the same hack as mysql multi-master replication with N members - fixed offsets. If you have 2 members of the federation, the first only uses odd IIDs, the second only uses even IIDs.

Artifacts and pipelines are more difficult. We might just have to disable CI on all but one node to begin with.

File conflicts won't happen as we add random hex to every upload. We'd need to tell the other nodes to pull the file each time one was uploaded, though.

Repository conflicts are being handled orthogonally. We can apply the same logic to both main and wiki repository.

Memberships could be left out-of-scope to begin with, but we could consider automatic linking by email address or a fixed map of user equivalences between instances/groups/projects too.

What else?

This feature proposal represents a "less-trust" form of Geo selective sync. It's something you can set up between two independent GitLab instances. Both sides would be read-write, and it could be set up entirely in the GitLab UI with no need for sysadmin work or postgresql replication on the respective instances. Since only someone who is an instance/namespace/project admin can set this up, I don't think there are permissions problems to worry about.

Links / references

/cc @jramsay

Edited Sep 27, 2018 by Nick Thomas