Gitaly Replicas based on Geo replication

this is a follow up on the discussion here: https://gitlab.slack.com/archives/C32LCGC1H/p1540571721004700

Introduction

This is a proposal for a simple version of Gitaly HA-based on a extraction of Geo (a.k.a. a simple version of "GitLab Spokes").

In previous Geo discussions, we explored the idea of implementing something similar to GitHub spokes, by implementing a consensus solving algorithm either based on Paxos or Raft.

That would require exploring functionality on the Git layer to achieve a partial lock, replicate the commit to a majority of replicas before ACK back to the user.

While this is still an interesting idea, we can try some much simpler that would still improve our current situation.

Right now, the only possible way to have a replica of the repository is by using Geo. While Geo provides that, it also provides much more, and cares about much more than just the repositories.

There is also the inconvenience of having multiple URLs (that can make sense in a distinct geographical location, but not when you are trying to split the load in a single location).

So there is room for a middle ground solution. This middle ground solution can leverage the repository replication component of Geo, and ignore all the rest.

Use-case example

Setup:

  1. Primary Geo node: US-East
  • 2 additional gitaly replicas (1 writable replica, 2 read-only replicas, for each "shard")
  • PostgreSQL with HA (consul)
  • Redis with Sentinel
  • Multiple gitlab-rails machines
  • Multiple sidekiq-worker machines
  1. Secondary Geo node: US-West
  • 1 additional gitaly replica (1 writable replica 1 read-only replica)
  • PostgreSQL with HA (consul)
  • Redis with Sentinel
  • Multiple gitlab-rails machines
  • Multiple sidekiq-worker machines

In this initial setup, gitaly replicas will be used only to split the load and for HA with limited capability (you can only degrade to a read-only replica if it was in sync with the master. Because there is no requirement to commit to more than one replica in this version, we can't automatically failover in every situation.

Pros and Cons

This suggestion allows implementing Multi-Region (Geo locations) with HA in each of them including for the Repositories. That also allows implementing HA for the repositories without using Geo.

There is no Raft/Paxos, so implementation is much simpler. There is also no replication happening in the same "push" transaction, no branch locking etc.

A Raft / Paxos is still desirable as a future iteration as it reduces the possibility of any data-loss if there is a disaster situation on the gitaly primary.

Because there is no multi-write, there is no additional latency when pushing code. Only one machine needs to "ACK".

We still get a way to split the load the to survive to disaster with minimal to no loss.

How to implement

There are few components in this proposal:

  1. The replication mechanism
  2. The coordination/state tracking
  3. The load balacing

The load balancer is the easiest one. There is almost no need to change anything on the two services that handle git operations coming from the client: gitlab-workhorse and gitlab-shell.

Whenever there is a git operation it will git one of the two. They will ping the API for authentication and to retrieve the gitaly server they need to communicate with.

All we need to do here is to extend the API to provide optionally more than one gitaly endpoint, annotating it as read-only or read-write, and gitlab-workhorse or gitlab-shell can decide when to use one or the other.

When the system is healthy and everything is in sync, it will always inform both. If the writable replica is down and the repository in the read-only replica is still in sync, it can still be used for pulling data, but will deny a push operation.

The replication mechanism shoudl be very similar to how Geo works. a Gitaly replica will clone from the primary replica whenever there is an update, via sidekiq jobs (as we do for Geo). That sidekiq job will update a simpler version of the Tracking Database.

The Tracking Database will be used even when Geo is not enabled, if gitaly replicas are used. So they can store state, separate from the main GitLab database.

By separate it into a distinct database we reduce the load on the main one and we can use the same mechanism for a Geo primary or Geo secondary.

So if Geo is enabled, the secondary don't need to receive gitaly replication data from the primary, it will only receive "Geo replication events".

This is a simpler and boring evolution, there are lots of optimization opportunities, but I believe this is something that can be delivered in 2-4 release cycles.

Edited Oct 29, 2018 by Toon Claes
Assignee Loading
Time tracking Loading