Skip to content

Add MirroredStorage to keep data consistent between multiple storages

Jeremiah Bonney requested to merge jbonney/mirrored-storage into master

Before raising this MR, consider whether the following are required, and complete if so:

  • Unit tests
  • Metrics
  • Documentation update(s)

If not required, please explain in brief why not.

Description

This PR adds a MirroredStorage CAS backend which given a list of storages will keep the data in sync. This means all writes will go to all storages, and reads/FindMissingBlobs will query each backend. If any blobs are missing from a subset of storages data is downloaded from a storage which does have the blob and uploaded to the missing storages.

To help make the FindMissingBlobs code more understandable I added a HashableDigest helper dataclass, which I think is cleaner than a dict/set of strings corresponding to the hash.

Changes proposed in this merge request:

  • Add HashableDigest dataclass
  • Add MirroredStorage CAS backend

Validation

Added to the standard storage tests as well as some dedicated tests to verify that the mirroring works for multiple storages. This can also be validated manually using a config with N disk storages, which lets you easily play around with deleting some/all of the blobs and verifying that data is mirrored properly

Merge request reports