[meta] Disaster Recovery

Disaster Recovery is Generally Available for single secondary configurations

Customer want a disaster recovery solution to prevent their organization being severely impacted by a data center outage or some other major failure. We also want to be able to use such a solution for GitLab.com.

A key component of disaster recovery is making sure that data is replicated and current in another location that is accessible. GitLab Geo provides this foundation.

To offer a comprehensive disaster recover solution, everything needs to replicated and accessible. Not all of these are required for Disaster Recovery to be GA.

git %10.2
git LFS
- object storage gitlab-org/gitlab-ee#3944 (replicated externally)
- local (disk, NFS etc)
wiki %10.2
database (issues, merge requests, snippets etc)
attachments (images on issues and merge requests)
- object storage gitlab-org/gitlab-ee#3944 (replicated externally)
- local (disk, NFS etc)
CI logs and artifacts
- object storage gitlab-org/gitlab-ee#3944
- local (disk, NFS etc) gitlab-org/gitlab-ee#2388
GitLab Pages assets (.html, .css, .js etc that will be served)
ElasticSearch gitlab-org/gitlab-ee#1186

Proposal

We want to offer a Disaster Recovery solution that our customers will want to buy, but also that we will be able to use it ourself for GitLab.com. GitLab.com is the biggest GitLab installation that we know of, and has its own constraints. However, we are confident that if we fix this issue for us, it will be beneficial for our customers, and we will be alerted of the potential bugs before our customers, making it a more solid product.

The feature will be called Disaster Recovery, once marketed.

%10.5 Single-secondary GA &17 (closed)
%10.7 Multi-secondary GA &65 (closed)

Enhancements

Planned failover process migrating between data centers (like the GCP migration)
Support Elasticsearch in Geo secondary nodes gitlab-org/gitlab-ee#1186

Implementation notes (geo related, not DR)

## Implementation approach

We tried the MinIO approach but realized it won't work for us for a variety of reasons. We are now investigating to build our own solution.

Every attachment is tracked in the primary node's DB.
Secondary nodes have a new tracking DB.
We check periodically the tracking DB and find the highest updated_at timestamp
Find the first X timestamps in the primary node's DB that are later than this updated_at
Replicate those files and update the secondary node's table once it's done
Rinse and repeat.

Previous releases

### Version-by-version breakdown

9.0

Setup the tracking database
Record file uploads in the database (https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/8893)
Much improved repository backfill mechanism (https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/1197)
Support for replicating LFS objects (https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/1237)
Support for basic status monitoring for Geo Nodes (see https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/1197)
Update Omnibus with these new changes

9.1

Make GitLab Geo easier to install for developers with GDK gitlab-development-kit!270 (merged)
Make GitLab Geo installation process easy and automated (https://gitlab.com/gitlab-org/gitlab-ee/issues/1664)
Add support for remaining file replication (e.g. attachments, etc.) https://gitlab.com/gitlab-org/gitlab-ee/issues/1955

9.2

Improve UX on Geo Nodes screen https://gitlab.com/gitlab-org/gitlab-ee/issues/1975
Resync repositories that have been updated recently https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/1826

9.3

Add new Geo event logs for project deletions and renames
Add push events to Geo event log: https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/1976

9.4 (July)

Support PostgreSQL replication slots
Improve speed of cloning and replication (e.g. by using more parallel workers)
Enable Geo log cursor

9.5 (August)

Unhide all refs from GitLab: https://gitlab.com/gitlab-org/gitlab-ee/issues/2959
Detach repository group and path name from disk: https://gitlab.com/gitlab-org/gitlab-ce/issues/28283
Deprecate system hooks: https://gitlab.com/gitlab-org/gitlab-ee/issues/2174#note_28319238
Group-level selective replication: https://gitlab.com/gitlab-org/gitlab-ee/issues/2224#note_33117061

10.0 (September)

Start testing Geo with GitLab.com: https://gitlab.com/gitlab-com/infrastructure/issues/2293, https://gitlab.com/gitlab-org/gitlab-ee/issues/1884 @jarv
Remove Geo system hooks: https://gitlab.com/gitlab-org/gitlab-ee/issues/3110 @to1ne
Instrument all project/file download times: https://gitlab.com/gitlab-org/gitlab-ee/issues/3020 @stanhu
Implement migration path from legacy to hash-bashed storage format: https://gitlab.com/gitlab-org/gitlab-ee/issues/3118 @brodock

10.1 (October)

https://gitlab.com/groups/gitlab-org/boards/364268?scope=all&utf8=%E2%9C%93&state=opened&milestone_title=10.1&label_name[]=Geo

Edited Apr 11, 2018 by Gabriel Mazetto

Assignee Loading

Time tracking Loading