Add MirroredStorage to keep data consistent between multiple storages
Before raising this MR, consider whether the following are required, and complete if so:
-
Unit tests -
Metrics -
Documentation update(s)
If not required, please explain in brief why not.
Description
This PR adds a MirroredStorage
CAS backend which given a list of storages will keep the data in sync. This means all writes will go to all storages, and reads/FindMissingBlobs will query each backend. If any blobs are missing from a subset of storages data is downloaded from a storage which does have the blob and uploaded to the missing storages.
To help make the FindMissingBlobs code more understandable I added a HashableDigest
helper dataclass, which I think is cleaner than a dict/set of strings corresponding to the hash.
Changes proposed in this merge request:
- Add
HashableDigest
dataclass - Add
MirroredStorage
CAS backend
Validation
Added to the standard storage tests as well as some dedicated tests to verify that the mirroring works for multiple storages. This can also be validated manually using a config with N disk storages, which lets you easily play around with deleting some/all of the blobs and verifying that data is mirrored properly