This document is a work-in-progress and proposes architecture changes for the GitLab.com SaaS.
The goal of these changes are to maintain GitLab.com service continuity in the case a regional or zonal outage.
For the current state see [Disaster Recovery Policies for GitLab Backups](/handbook/engineering/gitlab-com/policies/backup/#disaster-recovery).
- A **zonal recovery** is required when all resources are unavailable in one of the three availability zones in `us-east1` or `us-central1`.
- A **regional recovery** is required when all resources become unavailable in one of the regions critical to operation of GitLab.com, either `us-east1` or `us-central1`.
| GitLab Team Members | Ensure adherence to the requirements outlined in this policy |
| Engineering (Code Owners) | Approve significant changes and exceptions to this policy |
## Procedure
GitLab defines:
- The services requiring backups
- The frequency of backups, data retention periods, and restoration processes
- The procedures for data restoration for Disaster Recovery scenarios
### Backup and Restore
#### PostgreSQL Databases
GitLab.com database backups are taken every 24 hours, with incremental updates every 60 seconds. The data is securely streamed to [GCS](https://cloud.google.com/storage), encrypted, and retained for 90 days. The CustomersDot database is backed up daily with a retention period of 7 days.
All databases are continuously monitored to ensure successful backups, with alerts triggered for missing backups. Restore processes are validated through regular restoration from disk snapshots and replaying WAL segments.
#### Git Repositories
Repositories are backed up hourly using block-level disk snapshots. These snapshots are stored in multi-region object storage and retained for 14 days. All disks are monitored, with alerts triggered for missing snapshots.
Restore validation is conducted by randomly sampling disks and restoring recent snapshots.
#### Object Storage
Data stored in Object Storage (GCS) benefits from Google's [99.999999999% annual durability](https://cloud.google.com/storage/docs/storage-classes#descriptions) and multi-region bucket redundancy. To enhance data protection, [Object Versioning](https://cloud.google.com/storage/docs/object-versioning) and [Soft Delete](https://cloud.google.com/storage/docs/soft-delete) are enabled.
Automated restore validation is not required for Object Storage due to its inherent protections through versioning and soft delete.
### Disaster Recovery
For disaster recovery, backups are validated through periodic restoration exercises called "Game Days" to ensure compliance with recovery time objective (RTO) and recovery point objective (RPO) targets.
GitLab.com is deployed in the `us-east1` region across multiple GCP availability zones.
For short-term outages affecting a single zone within `us-east1`, unaffected zones will scale to restore service.
For the Gitaly service, recovery from backups will be necessary if data loss occurs.
Disaster recovery operations adhere to the [Disaster Recovery runbooks](https://gitlab.com/gitlab-com/runbooks/-/tree/master/docs/disaster-recovery).
These procedures target specific services to allow parallelized recovery efforts.
Mock disaster recovery (DR) events are conducted quarterly to simulate incidents affecting one or more services.
These exercises validate DR processes and readiness for real incidents.
During these [Game Days](https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/disaster-recovery/gameday.md), RTO and RPO targets are validated by [recording measurements for each procedure](https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/disaster-recovery/recovery-measurements.md).
## Exceptions
Exceptions to this policy will be managed in accordance with the [Information Security Policy Exception Management Process](/handbook/security/controlled-document-procedure/#exceptions).
for details). On a daily schedule, a fresh database GCE instance is
created that restores from the latest backup, gets configured as an
archive replica that recovers from the WAL archive (essentially
performing PITR). After this is complete, the restored database is
verified.
There is monitoring in place to detect problems with the restore
pipeline (currently using [deadmanssnitch.com](https://deadmanssnitch.com)).
We plan to monitor the time it takes to recover and other metrics soon.
### Disaster Recovery Replicas
The backup strategy above is a *cold backup*. In order to restore from a
cold backup, we need to retrieve the full backup from a cold storage
(via network) and perform PITR from it. This can take quite some time
considering the amount of data needed to be put on the network.
The current speed of restoring a cold backup from AWS S3 is at about 380
GB per hour (net size) for retrieving the base backup. With a database
size of currently 2.1 TB, just retrieving the base backup alone takes
more than 5 hours already. The PITR phase is generally deemed slower.
We currently aim at a `DB-DR-TTR` of 8 hours to recover from a backup.
We're *not there yet* and as an interim measure, we introduce disaster
recovery replicas.
#### Delayed Replica
Another option is to have a replica in place that always lags a few
hours behind the production cluster. We call this a *delayed replica*: It
is a normal streaming replica but delayed by a few hours. In case
disaster strikes, it can be used to quickly perform PITR from the WAL
archive. This is much faster than a full restore, because we don't have
to fully retrieve a full backup from S3. Additionally, with daily
snapshots the latest snapshot is 24 hours (plus the time it took to
capture the snapshot) old worst-case. A delayed replica is constantly
kept at a certain offset with respect to the production cluster and hence
does not need to replay too many hours worth of data.
* Production host: `postgres-dr-delayed-01-db-gprd.c.gitlab-production.internal`
* Chef role: `gprd-base-db-postgres-delayed`
#### Archive Replica
Another type of replica is an *archive replica*. It's sole purpose is to
continuously recover from the WAL archive and hence *test* the WAL
archive. This is necessary because PITR relies on a continuous sequence
of WAL that can be applied to a snapshot of the database (basebackup).
If that sequence is broken for whatever reason, PITR can only recover
until this point and no further. We monitor the replication lag of the
archive replica. If it falls back too far, there's likely a problem with
the WAL archive.
The restore testing pipeline also performs PITR from the WAL archive and
thus also would be able to detect (some) problems with the archive. However,
employing an archive replica that is close to the production cluster
helps to detect problems with the archive much faster than with a daily
test of a backup. Also, the archive replica has to consume all WAL from
the archive - a backup restore is likely to only read a portion of the
archive to recover to a certain point in time.
In that sense, there is overlap between functionality of archive and
delayed replicas and the restore testing. Together it gives us high
confidence in our cold backup and PITR recovery strategy.
* Production host: `postgres-dr-archive-01-db-gprd.c.gitlab-production.internal`
* Chef role: `gprd-base-db-postgres-archive`
## Exceptions
Exceptions to this procedure will be tracked as per the [Information Security Policy Exception Management Process](/handbook/security/controlled-document-procedure/#exceptions).
@@ -129,44 +129,9 @@ For some incidents, we may figure out that the usage patterns that led to the is
1. The definition of abuse can be found on the [security abuse operations section of the handbook](/handbook/security/)
1. In the event of an incident affecting GitLab.com availability, the SRE team may take actions immediately to keep the system available. However, the team must also immediately involve our security abuse team. A new [security on call rotation](/handbook/security/security-operations/sirt/engaging-security-on-call/) has been established in PagerDuty - There is a Security Responder rotation which can be alerted along with a Security Manager rotation.
## Backups
## Backup and Restore
### Purpose
This section is part of [controlled document](/handbook/security/controlled-document-procedure/) covering our controls for backups. It covers BCD-11 in [the controls](/handbook/security/security-assurance/security-compliance/guidance/business-continuity-and-disaster-recovery/).
### Scope
Production database backups
### Roles & Responsibilities
| Role | Responsibility |
|-----------|-----------|
| Infrastructure Team | Responsible for configuration and management |
| Infrastructure Management (Code Owners) | Responsible for approving significant changes and exceptions to this procedure |
### Procedure
Backups of our production databases are taken every 24 hours with continuous incremental data (at 60 sec intervals), streamed into [GCS](https://cloud.google.com/storage). These backups are encrypted, and follow the lifecycle:
- Initial 7 days in [Multi-regional](https://cloud.google.com/storage/docs/storage-classes#standard) storage class.
- After 7 days migrated to [Coldline](https://cloud.google.com/storage/docs/storage-classes#coldline) storage class.
- After 90 days, backups are deleted.
- Snapshots of non Patroni-managed database (e.g. PostgreSQL DR replicas) and non-database (e.g. Gitaly, Redis, Prometheus) data filesystems are taken every hour and kept for at least 7 days.
- Snapshots of Patroni-managed databases (a designated replica, in fact) are taken every 6 hours and kept for 7 days.
Data stored in Object Storage (GCS) such as artifacts, the container registry, and others have no additional backups, relying on the [99.999999999% annual durability](https://cloud.google.com/storage/docs/storage-classes#descriptions) and multi-region buckets.
For details see the runbooks, particularly for [GCP snapshots](https://gitlab.com/gitlab-com/runbooks/blob/master/docs/uncategorized/gcp-snapshots.md) and [Database backups using WAL-E/WAL-G (encrypted)](https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/patroni/postgresql-backups-wale-walg.md)
### Exceptions
Exceptions to this backup policy will be tracked in the [compliance issue tracker](https://gitlab.com/gitlab-com/gl-security/security-assurance/team-commercial-compliance/compliance/-/issues/).