Problem to solve
Common data integrity problems with uploads:
- The file doesn't exist
- The file changed (checksum mismatch)
- The associated model does not exist (i.e. a project record was deleted)
- Geo only: The file doesn't exist on a secondary
- Geo only: The file changed on a secondary
These problems can arise due to bugs or transient infrastructure problems.
Use case: Geo DR
This is especially relevant to Geo Disaster Recovery. Leading up to the GCP migration, I spent many hours verifying, fixing, and reverifying uploads. We've fixed a lot of bugs in the process, and GitLab.com has switched to hashed storage for uploads, but sysadmins need assurance of data integrity or at least an indication that data needs to be fixed, in order to trust a DR solution.
- Add service to verify Uploads on secondaries
- Add service to checksum Uploads on primary
- Automatically verify Uploads on secondaries
- Ensure Upload checksum on primary
- Track primary Upload checksum counts
- Track secondary Upload verification counts
- When a primary Upload is checksummed, reset verification on secondaries
- Regularly reverify Uploads on primary
- Let admins adjust file verification rate
- Allow verification of files in Object Storage
We should track the number of
uploads that have passed vs failed verification. We could group by known failures too, but on second thought I don't think it is necessary on the first iteration. Most of the value is in having charts of passed vs failed.
What does success look like, and how can we measure that?
- Number of verified uploads
- Number of failed verification uploads
- Should the non-Geo-specific work go into CE? => No: https://gitlab.com/gitlab-org/gitlab-ee/issues/7184#note_120279789 Answer: Not right now