Explore idea: Enable consistent backups via delayed deletion, and low RPO via copying all new blobs to backup_upload
Problem
Consistent backups are not possible without downtime. Consistency is difficult to achieve across the 3 major classes of GitLab data: Postgres, Git, and Blobs.
If we could somehow restore live backups of all that data to a single point-in-time, then that restore would be consistent.
- Postgres data: Postgres can already be dumped at a point-in-time (PIT), and it can also be WAL archived for continuous backup and PIT restore.
- Git data: Gitaly is working on WAL architecture, which should allow PIT restore.
- Blobs:
❓ How can we achieve PIT restore?❓
Proposal
Implement delayed file deletion for all blobs.
This provides the following guarantee:
A slow copy of all blobs started at a PIT contains all blobs that existed at the PIT, given the copy finishes before the delay duration.
As long as we properly handle any "extra" data, then consistent backups are possible.
Daily backup and restore procedure
(Ignoring Gitaly)
Set up backups:
- Configure e.g. a 24 hour file deletion delay
- Configure PG backup daily
- Configure object storage backups to trigger at exactly the same time as PG backup
Restore procedure:
Backups are consistent at each PIT that the backup began.
- This works for local or remote file storage.
- This works for the most basic of file backup mechanisms: Copy.
- This works for air-gapped instances.
Continuous backups on AWS or similar
Configure continuous backups of PG and S3. You can restore to any PIT and it will be consistent. (I assume there is some margin of error.)
Assumptions
If any of these are not true, then either we need to make it so, or we must list it as a limitation of consistent backups.
- All blobs are immutable-- GitLab never modifies blob data at a particular path.
- There is no way for a user to access a blob which is delay-deleted. E.g. if you delete an
uploads
row, then the blob becomes completely inaccessible through GitLab. We may need to double check this for models which store Carrierwave file fields on their own table (instead of theuploads
table). - If PG says a file path is not taken but a file exists at that path, then GitLab must act as if the file path is not taken.
- In general, the PG DB must be the SSOT for file existence etc.