Make it easier to start using Geo and Backup integrity
Me and @sranasinghe discussed this today when we met:
Looking at the volume of support time we are required to help with, my gut feeling was that a good portion of it (~80%) is related to issues setting up Geo.
Either the documentation is not followed correctly, or the customer makes mistakes with the multiple configuration files they have to change to get Geo working.
This was again a reminder that the lack of proper centralized configuration system, makes it very easy to make those mistakes.
In the past I've advocated for us + distribution to build some layer on top of Consul to handle that, and if we had it already we could likely cut most of those 80% off.
Another angle @sranasinghe mentioned is that after you correctly configure Geo (considering a larger installation), it takes time for it to reach a green state.
A good portion of that time goes just running regular verification for the first time.
A good strategy here would be to think of ways to make verification part of the GitLab Premium offer, and not necessarily tied to Geo.
Based on some discussion around how to improve Backups, we've identified a gap in our offering regarding how you can trust/validate that the backup archive you have, actually works (or doesn't have any corrupted data in it).
A good DR strategy regarding to backups requires you to try to restore from a backup, as a test/validation that your backup can actually restore your data in case you ever need it.
Because that is a very expensive operation to do, you may want to do that only a couple of times a year. For a more regular verification operation, what you want instead, is a way to verify that a backup archive is not corrupted.
There are many layers to that:
The simples one, is to make sure you can uncompress the compressed data.
A more expensive version of that, is to either using a sample size, or doing a validation against every single bit of data, that what is stored, match a certain checksum.
Tying this idea back to verification, if we make verification, part of a regular GitLab Premium featureset, we could provide "Verifiable Backups" or somethig with a similar name/conotation, that would include checksum files along with each one, and those files would rely on the verification mechanism.
Having that verification mechanism running for everything by default, would mean adopting Geo is a smaller step from that.
For verification to work as intended (to identify when data is corrupted), we need to change how we do it slightly:
- We still create a verification hash at the time a blob is persisted
- We store in a separate field, the subsequent verification attemps
If they ever divert, it means data was modified/corrupted. Because we store immutable data, that should never happen.
When that situation is identified, the user has two options to restore the file:
- Restore from a backup (and we can validate that because the backup archive will also include the hashes, which we can also compute and validate against what is stored)
- Restore from a Geo secondary (we don't do that today, but it could be done)
Having verification decoupled from the Geo offering and tied to Backups, means more people will use it, and it means we can have more adoption / pressure to support it internally, which feeds back into Geo again down the pipeline.