Commit ccf9b967 authored by Rafael Henchen's avatar Rafael Henchen
Browse files

Implement GCS bucket policy changes: 14-day retention in default storage, remove coldline storage

parent 7b7d20ad
Loading
Loading
Loading
Loading
+2 −2
Original line number Diff line number Diff line
@@ -52,10 +52,10 @@ GitLab defines:

| Area                | Details                                                                                                           |
| ------------------- | ----------------------------------------------------------------------------------------------------------------- |
| Backup frequency    | A full backup is taken every 24 hours, with incremental updates every 60 seconds                                  |
| Backup frequency    | A full backup is taken every hour, with continuous transaction log archival.                                      |
| Storage             | Stored in [GCS](https://cloud.google.com/storage)                                                                 |
| Encryption          | Backup data is encrypted in transit and at rest                                                                   |
| Retention           | 90 days (7 days for CustomersDot database)                                                                        |
| Retention           | 14 days (7 days for CustomersDot database)                                                                        |
| Loss prevention     | [Soft Delete](https://cloud.google.com/storage/docs/soft-delete) enabled, with 7 day retention                    |
| Location/redundancy | [Multi-region geo redundancy](https://cloud.google.com/storage/docs/availability-durability)                      |
| Monitoring          | All databases are continuously monitored to ensure successful backups, with alerts triggered for missing backups. |
+11 −12
Original line number Diff line number Diff line
@@ -30,17 +30,16 @@ In backup and recovery, there are two SLOs:
| SLO           | Current level | Definition |
| ------------- |:-------------:| -----:|
| `DB-DR-TTR`   | 8 hours       | Maximum time to recovery from a full database backup in case of disaster|
| `DB-DR-RETENTION-MULTIREGIONAL`  | 7 days       | The number of days we keep backups for recovery purposes in [Multi-regional](https://cloud.google.com/storage/docs/storage-classes#standard) Storage class in GCS. |
| `DB-DR-RETENTION-COLDLINE`  | From 8 to 90 days       | The number of days we keep backups for recovery purposes in [Coldline](https://cloud.google.com/storage/docs/storage-classes#coldline) storage class in GCS. |
| `DB-DR-RETENTION` | 14 days | The number of days we keep backups for recovery purposes in [Multi-regional](https://cloud.google.com/storage/docs/storage-classes#standard) Storage class in GCS. |

The backup strategy is to take a daily snapshot of the full database
(basebackup) and store this in Google Cloud Storage. Additionally, we capture the
write-ahead log data in GCS to be able to perform point-in-time recovery
(PITR) using one of the basebackups. [Read more on Disaster Recovery](/handbook/engineering/gitlab-com/policies/disaster-recovery/)
The primary backup strategy is to take hourly incremental disk snapshots (block level) of all our database clusters (these are 
[multi-regional standard persistent disk snapshots](https://docs.cloud.google.com/compute/docs/disks/snapshots)).
We also implement a secondary backup strategy with weekly full backups of database files (database level) and daily incremental 
backups stored on separate multi-region Google Cloud Storage buckets. Additionally, we continuously archive all 
write-ahead (transaction) log data in GCS to enable point-in-time recovery (PITR) using any backup strategy 
(block-level or database-level). [Read more on Disaster Recovery](/handbook/engineering/gitlab-com/policies/disaster-recovery/)

For `DB-DR-TTR` we need to consider worst-case scenarios with the
latest backup being 24 hours old. Hence recovery time includes the time
it takes to perform PITR to recover from archive to a certain point in
Recovery time includes the time to perform PITR from the baseline backup plus transaction log archive recovery up to a certain point in
time (right before the disaster).

We are able to recover to any point in time within the last `DB-DR-RETENTION` days.