Next Gen Scalable Backup and Restore
Problem to solve
Scaling
The current backup implementation relies on rake task GitLab and is not scaling well for our large customers. Those have large datasets and are on large reference architectures running Gitaly cluster.
Some customers want to take daily backups of their GitLab infrastructure. The frequency may be driven by an external or internal compliance requirement. However, with hundreds of GB of repository, Several TBs of blob data and large DBs exceeding 100 GB, the current solution can take in excess of 24 hours to complete a backup.
Multiple solutions
We have different tools and capabilities across our reference architectures which is confusing and frustrating for our customers. Running a Backup today requires reading multiple pages of documentation, and the instructions are provided with a decision tree of choices that the customer may not know the answer or doesn't have the required knowledge to make a decision.
Adding new data types
As GitLab evolves, we will add new data components. Each data component will potentially require a new bespoke effort to add to the backups while having to re-implement them in the multiple installation types we support.
Incorporating these new data components into the backups should be easy and they should work across all supported installation types.
Intermediary caching and processing data
Most customers with large GitLab instances leverage object storage as a cost-effective, scalable file storage solution. Our current backup solution copies these files to a local machine for composing the backup which subsequently gets uploaded back to object storage. Since this can involve terabytes of data the time and costs involved can be significant. It has also proven to be unreliable in some cases.
We need to have a scalable backup solution that can help our customers protect their data from unintentional as well as malicious corruption.
Intended users
Sidney (Systems Administrator)
User experience goal
The system admin will be able to take on-demand as well as scheduled backups of their GitLab instance.
The early solution will focus on cloud-hosted GitLab installations. It will package the GitLab data on object storage in such a way that the vendor’s tools can take over for long-term storage. The user is expected to drive the vendor’s UI or APIs to move the backup to their preferred location. Similarly for restoration, the user is expected to retrieve and make available the backup data. In instances where the vendor features were used to backup data such as DB or object storage buckets, the user is expected to restore these to operational instances that can be accessed by the GitLab instance onto which the data is being restored and the backup solution.
The interface to this solution will be CLI-based. It will have a unified CLI that works across all the different types of GitLab installations.
As the solution evolves we will explore ways to automate the vendor-side functions by talking to the respective APIs thus allowing the user to complete an end-to-end backup directly through the tool. The resulting backup stored at a specified location. Similarly restoration will be automated via integration with the vendor's APIs.
Proposal
A GitLab backup solution should cover the following GitLab data:
- Primary data (must have):
- PostgreSQL data
- Git Data
- Files
- Secondary data (nice to have):
- GitLab Configuration
- GitLab secrets
- Redis data
Cloud vs Portable backups
We define two broad conceptual backups formats:
- Cloud Backups - Only applicable for deployments hosted in the Cloud. These backups rely on solutions for creating and storing backups available from the cloud vendors for each type of data. The data is stored in a format that is proprietary to the cloud vendor and therefore these backups are only viable inside the cloud vendor’s environment. The solutions available from the Cloud vendors have been optimized for performance and cost within their environments and is often a good choice for
- Portable Backups - Applicable to all environments such as self-hosted, air-gapped and cloud-hosted. The data is stored in portable formats allowing the backup to be to a different environment to which it was created (i.e. A portable backup from a self-host environment can be restored to cloud hosted environment and vice versa. Often the nature of the tooling available means these backups are less scalable compared to solutions offered by cloud vendors within their environments.
There will be instances of mixed backups where parts of the data in the backup are reliant on a mixture of cloud vendor-specific data and portable data. As an example, a customer running Linux-packaged PostgreSQL in a cloud environment.
Principles
-
Single solution
The solution will have a single unified CLI that works across all the different types of GitLab installations - Linux packaged, Docker, Cloud Native Hybrid and GDK. A solution that will eventually handle backups for GitLab.com and GitLab Dedicated as well as self-managed installations.
-
Minimize copying data to temporary locations
The solution will minimize the copying of data into temporary locations for intermediate processing and instead strive to copy the relevant data directly to its final destination.
-
Easily extendable to support new data types
We will build a self-service framework to reduce the complexity of adding new data components and maximize the utility of shared code and logic similar to the Geo self-service framework. It will be easy to add new components, allowing the backups to keep up with new data elements created as GitLab evolves, thus protecting against data loss as a result of incomplete backups. In due course, we will provide guidance and a step-by-step process and templates for adding new data types as we have done for Geo.
-
Co-exist with existing solutions
The proposed backup solution will exist alongside the existing backup solutions. It is not intended to entirely replace existing solutions at this stage but instead provide a path forward for customers on large architectures with large data sets.
-
Scalable
The new backup solution should scale to support our largest reference architectures.
-
Leverage capabilities available on major cloud vendor platforms where possible
The backup solution will work closely with the capabilities already available on major cloud vendor platforms for securing data such as AWS, Azure and GCP. We will work with the lowest common denomination of features across these vendors. For example, all vendors support object storage with versioning and managed databases
-
Will not provision infrastructure
The backup solution will not be responsible for provisioning infrastructure. The solution will be able to restore backup data to an existing GitLab instance or a newly provisioned GitLab instance. Provisioning of a new GitLab instance is outside the scope of the backup solution.
Development of this solution will be done iteratively. We will start with Cloud backups and target the Cells architecture on GCP for our first milestone. We will then evolve the Cloud Backup solution to add support for other major cloud vendors and new capabilities before shifting our focus to portable backups.
Milestone 1 - Cloud backups for Cells architecture on GCP
The first milestone will focus on delivering a backup solution for the GitLab Cells architecture. Cells will be deployed on GCP with CloudSQL therefore, our focus will be on the GCP platform and deployment model.
We will focus on the primary data types (database, git, files) only for this milestone.
The solution will work synergistically with the backup services available on the GCP platform such as CloudSQL backup, storage transfer service and disk snapshots.
The backup tool will expect the required infrastructure such as buckets, service accounts, etc.. where the backup data will be stored to be pre-existing. The tool will be told the location and authentication method via configuration. The infrastructure will either be manually created by the systems administrators or managed by a tools such a GET.
Similarly, restoration requires the backup tool to be told the location of the backed-up data and how to access it via configuration. The tool will coordinate the unpacking of different data components to restore the data to a working GitLab instance.
Future iterations
Lower RPO with PIT
Medium-term we will support PIT for managed Postgres DB, Gitaly and Object storage. Initially, this will take the form of documenting how to configure and restore managed Postgres and object storage PIT on the vendor's interface combined with how to drive the GitLab tools for Gitaly. This capability will come with caveats and requirements such as lack of cross-region support and additional storage.
Integrate Gitaly server-side backups
The solution will integrate Gitaly server-side backups for repositories. Gitaly server-side backups provide a scalable solution for backing up repository data. The proposed solution coordinates and collates this data together with files and DB to form a single coherent backup that can be stored for later retrieval.
Incremental backups
Incremental backups will allow faster, more frequent backups and save storage by backing up only the changes since the last full backup. The solution will need to support incremental backups for all 3 major data components - Database, Repositores and Files.
Add support for AWS and Azure
Air-gapped and local file storage customers While the initial focus will be on cloud-hosted installations, longer-term we will add support for air-gapped and local file storage customers as well as those using minio and object storage appliances.
Air-gapped and local file storage customers
While the initial focus will be on cloud-hosted installations, longer-term we will add support for air-gapped and local file storage customers as well as those using minio and object storage appliances.
Further details
We will start by building out the unified CLI targeting 1K reference architectures to build out the foundations and expand to larger architectures in the following iterations with the ultimate goal of supporting all GitLab reference architectures.
We also aim for the new backup solution to take over backups on GitLab Dedicated and longer term GitLab.com on top of the cells architecture.
Permissions and Security
Backups and restore is an administrator-level function. Therefore, permission to the tool will require administrator-level access to the GitLab installation and command-line access to one of the rails nodes on your deployment.
Additional permissions and roles will be considered in the future.
Documentation Availability & Testing Available Tier
- Free
- Premium
- Ultimate
Feature Usage Metrics
- Number of successful backups each month - Number of successful backups each month per installation (incremental and full)
- Number of failed backups each month - Number of failed backups each month per installation (incremental and full)
- Average time for backup completion - Since one of the primary goals of this solution is to scale we want to measure the average time taken for backups to complete by each installation broken down by type of backup - incremental and full.
What does success look like, and how can we measure that?
- We see a decrease in the number of support tickets related to backups - especially around scaling.
- GitLab dedicated successfully uses this solution as their defacto backup solution.
What is the type of buyer?
Is this a cross-stage feature?
No
What is the competitive advantage or differentiation for this feature?
Links / references
This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.
This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.
This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.
This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.
This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.
- Show closed items