META: EXPLORATION: dismantle Geo into HA features
I was going to create this as an Epic, but this is going to need some discussion and https://gitlab.com/gitlab-org/gitlab-ee/issues/3889 . Note that it's still very rough, and we may not actually want to do it. I'm exploring the idea, rather than proposing the idea, at the moment.
Currently, Geo is considered to be a separate installation of GitLab, in a separate datacentre, using different DNS and storage backend to the primary. ~"Geo DR" is the process of converting one of those secondaries into a primary when it is removed.
We're working on a number of features to reduce the differences between Geo primaries and secondaries. Eventually, we expect to be able to git push to a secondary and have that work transparently. We expect ~"Geo DR" to be quick and easy. We may have a single repmgr configuration for all postgres servers in the cluster.
We also want to be able to replicate storage within a single node, to have multiple independent replicas within a single datacentre.
I think HA and Geo are solving lots of the same problems, and we have duplication between the two. What happens if we try to re-imagine Geo as a set of HA features with cross-DC / multi-region focus? E.g., we treat all the machines that make up all the Geo nodes as belonging to a single entity, that happens to be in more than one datacentre?
Users should connect to the Geo node closest to them
This is the biggest difference between a HA and a Geo setup, at least from a user's point of view. In HA, there's a single domain that points to a load-balancer or multiple machines, expected to all be in the same DC, that all handle traffic.
It doesn't have to be like this. We can use a single domain across datacentres and still transparently ensure that users get directed to the closest machine(s) to them. Solutions are generally DNS-oriented, and include terms like GeoDNS, split horizon DNS, and DNS anycasting.
We can also retain the Geo multiple-domain-based approach and apply it to HA if we want to, by adding support in GitLab for a machine to respond to multiple domains and serve them with the same code. There's already been some conversation about doing this to ease ~"Geo DR" failovers.
We'd then have, say, gitlab.com, eu.gitlab.com and us.gitlab.com. gitlab.com would contain the IPs of all load balancers, perhaps served via anycast DNS if available, eu.gitlab.com the european ones, and us.gitlab.com the american ones. If anycast DNS isn't available to a customer, they can ask their users to select the region explicitly by DNS as they can with Geo now. If it is, they don't have to bother their users with it.
Postgresql replication
We already have a postgresql HA solution that we're considering switching to entirely for Geo, replacing the current "one standby secondary" model.
Multiple independent replicas of Git data on separate storage backends
Geo's approach is filesystem-level replication of everything in the background, with https://gitlab.com/gitlab-org/gitlab-ee/issues/1381 being considered in the future.
The HA approach is "redundancy in gitaly" gitaly#843 (closed) . If you need N/2+1 replicas, it's not imposssible to state that at least one of those replicas should be in a separate DC before the push is "done".
Redis replication
In the HA model, there's a single shared Redis cluster. In Geo, there is an independent redis cluster per node.
We've been discussing adding redis replication at some point. If we do that, we could benefit from exactly the same work in "HA". What breaks if we treat geo as multi-region HA without it?
Various things are broken or degraded without it at present, notably caching.
DR
In Geo, this is the idea that all data is replicated to a physically distinct place. In HA, we divert that to gitaly redundancy and object storage, which can already replicate data without GitLab-specific code. The rest is just configuration of individual machines within the HA cluster, according to which region they're in.
In the multi-datacentre HA model, if one region completely fails, then all traffic immediately and automatically goes to the servers in the next-closest region. In the Geo model, we have to reconfigure every server in every region, including changing DNS and other slow-to-propagate things, before DR is complete.
What else am I missing?
Geo selective sync
This isn't a workable feature from a HA point of view, but then, I'm not aware of any customers using it at present. I recently opened an issue to do with federation: https://gitlab.com/gitlab-org/gitlab-ee/issues/4517 which might be a rough analogue to selective sync in some cases.
/cc @jramsay @ernstvn @stanhu help at refining / rubbishing this idea very welcome :)