Exploring the DR flow on GitLab.com

Introduction

We are addressing a regional, catastrophic outage where the entire region is offline or so degraded that it is unclear when service will recover. RTO targets will not be kept by doing nothing.

The below is based on a discussion with @brentnewton and @hphilipps to determine possible recovery flows.

Requirements

RTO target for Premium+ customers: 1 hour.
RPO target for Premium+ customers: 10 minutes
Free users must be recovered on a best-effort basis and we can't loose the data. Maybe 1 hour RPO is acceptable.
The cost of the DR site must be controlled; it is largely driven by Free user Git data.

DR site overview

We focused on four high-level components:

Stateless services - API nodes, Web nodes etc.
PostgreSQL Database ("Patroni")
Data in Object Storage
Gitaly and Git data

graph LR
  patroni1 -- "WAL level or Streaming replication" --> patroni2
  oo1 -- "Cross-region replication" --> oo2
  git1 -- "???" --> git2
  subgraph "gitlab.com"
    stateless1[Stateless services]
    patroni1[Database nodes]
    oo1[Object storage buckets]
    git1[Git data]
    
  end
  subgraph "dr.gitlab.com"
    stateless2[Stateless services]
    patroni2[Database nodes]
    oo2[Object storage buckets]
    git2[Git data]  
 
end

Stateless services can be scaled down on dr.gitlab.com
Patroni nodes can be scaled down somewhat
Git data is 95% from free users; means of replication TBD

How to replicate Git data?

The main constraint on the above is that Free user Git data is the main cost driver. Git data is kept on SSDs and replicating all data via e.g. Geo would double cost. These are some proposals on how to address this.

Git data replication - GCP Disc snapshots only

We already take regular snapshots of SSDs that store Git data. At its simplest we could rely only on those snapshots. The maximum frequency of snapshots is 10minutes - barely within our RPO target.

graph LR
  git1 -- "GCP Disc snapshots - every 10mins" --> git2
  subgraph "gitlab.com"
    git1[Git data]
    
  end
  subgraph "dr.gitlab.com"
    git2[Git data]  
 
end

Cost of snapshots not known yet
No separation between Premium+ and Free
Maximum data loss for Premium+ is 10 mins
Need to restore all disc snapshots and bring up Gitaly nodes - this may exceed the time within our RTO (need to test)

Git data replication - GCP Disc snapshots for Free, continuous replication for Premium+

This approach would utilize GCP Disc snapshots for Free user Git data likely at an hourly cadence. We would separate out Premium+ data into specific Gitaly nodes, snapshot discs as well as a backup e.g. every 10 minutes, but rely on some other more continuous mechanism to replicate Git data (e.g. Geo or another technology).

graph LR
  git1 -- "GCP Disc snapshots - every hour" --> git2
  git1 -- "Continuous replication" --> git3
  subgraph "gitlab.com"
    git1[Git data]
    
  end
  subgraph "dr.gitlab.com"
    git2[Free user Git data]
    git3[Premium+ Git data]    
 
end

Cost of GCP Disc snapshots not known
Premium+ data would be as up-to-date as possible (ideally less than a minute behind) and well within our RPO target. Maxium data loss is still 10mins via backup snapshots
Premium+ data would not need to be restored from snapshots; service can resume faster
We would need to restore Free data later on
We need a way to restrict GitLab access to Premium+ users

Git data replication - Gitaly uses Object Storage (hypothetical)

If Gitaly had the capability to store data in Object Storage, this would become essentially a no-op. We could rely on cross-region replication to replicate data.

Known product gaps

We have no way to restrict logins to Premium+ users only
We can select storage shards for Geo but AFAIK we don't have the ability to separate Premium+ git data onto specific nodes and discs
We have no automated promotion process for Geo
We can't easily route traffic to another site

Testing DR procedures

@hphilipps had a few ideas:

Ideally we could route a small percentage of users to a secondary site
We can perform bubble tests to test the entire process (promote the dr.site while isolated from the primary)

Open questions

How do we replicate Redis? Do we need to care about Sidekiq queues?
There are many small other DR steps that would need to be automated - what are they?
How do we handle a situation in which the database is more up-to-date compared to Git data for free users?
How do we handle a failback? Do we need to?

Failover flow overview

If we chose Git data replication - GCP Disc snapshots for Free, continuous replication for Premium+ this is how the entire DR process could look. I am assuming a full outage for now and am still omitting many details.

Determine that a service outage has occurred and that recovery is not possible in place without violating RPO/RTO targets
Inform all users about initiating DR procedures and what to expect
Prevent the primary site from starting if at all possible (split brain prevention)
Scale up all stateless services to a desired level
Promote database cluster
Scale up database cluster
Determine that all Git data for Premium+ users is available (otherwise resort to disc snapshots)
Lock out all non-free users
Reconfigure dr.gitlab.com to become read-writable
Other DR steps. e.g. DNS repointing etc.
Allow Premium+ users to login
Service should be restored for Premium+ users; Free users are locked out <------ Must be within one hour
Scale up Gitaly nodes and restore Free user snapshots of Git data
Allow free users to login
All services operational

Edited Feb 19, 2021 by Fabian Zimmer