Proposal: Use sync/rsync commands to restore backups
Proposal
During backup, the corresponding sync/rsync method is used depending on the storage provider (s3cmd sync, aws s3 sync or gsutil rsync), this enables the usage of features provided on each provider/CLI to speed up/guarantee the process.
s3cmd
Synchronize a directory tree to S3 (checks files freshness using size and md5 checksum, unless overridden by options, see below)
s3cmd sync LOCAL_DIR s3://BUCKET[/PREFIX] or s3://BUCKET[/PREFIX] LOCAL_DIR or s3://BUCKET[/PREFIX] s3://BUCKET[/PREFIX]
aws
sync
<LocalPath> <S3Uri> or <S3Uri> <LocalPath> or <S3Uri> <S3Uri>
gsutil
The gsutil rsync command makes the contents under dst_url the same as the
contents under src_url, by copying any missing files/objects (or those whose
data has changed), and (if the -d option is specified) deleting any extra
files/objects. src_url must specify a directory, bucket, or bucket
subdirectory. For example, to make gs://mybucket/data match the contents of
the local directory "data" you could do:
gsutil rsync -d data gs://mybucket/data
To recurse into directories use the -r option:
gsutil rsync -d -r data gs://mybucket/data
To copy only new/changed files without deleting extra files from
gs://mybucket/data leave off the -d option:
gsutil rsync -r data gs://mybucket/data
If you have a large number of objects to synchronize you might want to use the
gsutil -m option, to perform parallel (multi-threaded/multi-processing)
synchronization:
gsutil -m rsync -d -r data gs://mybucket/data
This same process can be used to restore a backup, once the bucket is backup in tmp-backups, instead of invoking cleanup, we can directly call restore_from_backup, but instead of looping for every single file, rely on sync/rsync and upload the missing/changed files, or remove the non-existing ones.
This might work in both scenarios where we are restoring to a freshly installed instance, or recovering a previous backup on an existing instance.
Edited by Ferran Vidal