Skip to content

[meta] Disaster Recovery

Disaster Recovery is Generally Available for single secondary configurations

Customer want a disaster recovery solution to prevent their organization being severely impacted by a data center outage or some other major failure. We also want to be able to use such a solution for GitLab.com.

A key component of disaster recovery is making sure that data is replicated and current in another location that is accessible. GitLab Geo provides this foundation.

To offer a comprehensive disaster recover solution, everything needs to replicated and accessible. Not all of these are required for Disaster Recovery to be GA.

  • git %10.2
  • git LFS
    • object storage gitlab-org/gitlab-ee#3944 (replicated externally)
    • local (disk, NFS etc)
  • wiki %10.2
  • database (issues, merge requests, snippets etc)
  • attachments (images on issues and merge requests)
    • object storage gitlab-org/gitlab-ee#3944 (replicated externally)
    • local (disk, NFS etc)
  • CI logs and artifacts
    • object storage gitlab-org/gitlab-ee#3944
    • local (disk, NFS etc) gitlab-org/gitlab-ee#2388
  • GitLab Pages assets (.html, .css, .js etc that will be served)
  • ElasticSearch gitlab-org/gitlab-ee#1186

Proposal

We want to offer a Disaster Recovery solution that our customers will want to buy, but also that we will be able to use it ourself for GitLab.com. GitLab.com is the biggest GitLab installation that we know of, and has its own constraints. However, we are confident that if we fix this issue for us, it will be beneficial for our customers, and we will be alerted of the potential bugs before our customers, making it a more solid product.

The feature will be called Disaster Recovery, once marketed.

Enhancements

  • Planned failover process migrating between data centers (like the GCP migration)
  • Support Elasticsearch in Geo secondary nodes gitlab-org/gitlab-ee#1186
Implementation notes (geo related, not DR) ## Implementation approach

We tried the MinIO approach but realized it won't work for us for a variety of reasons. We are now investigating to build our own solution.

  • Every attachment is tracked in the primary node's DB.
  • Secondary nodes have a new tracking DB.
  • We check periodically the tracking DB and find the highest updated_at timestamp
  • Find the first X timestamps in the primary node's DB that are later than this updated_at
  • Replicate those files and update the secondary node's table once it's done
  • Rinse and repeat.
Previous releases ### Version-by-version breakdown

9.0

9.1

9.2

9.3

9.4 (July)

  • Support PostgreSQL replication slots
  • Improve speed of cloning and replication (e.g. by using more parallel workers)
  • Enable Geo log cursor

9.5 (August)

10.0 (September)

10.1 (October)

https://gitlab.com/groups/gitlab-org/boards/364268?scope=all&utf8=%E2%9C%93&state=opened&milestone_title=10.1&label_name[]=Geo

Edited by Gabriel Mazetto