Geo: Improving Repository Checksum
Description
When discussing with @nick.thomas about our checksums, he pointed out to me that we are not including every aspect of it. As a matter of fact, we are excluding any refs/merge-request and refs/keep-around from our checks:
https://gitlab.com/gitlab-org/gitlab-ee/blob/26ed2d2aa0f22fba27c1c7f991aae1b12e2aaa78/ee/lib/gitlab/git/checksum.rb#L34
We need to improve that while keeping it fast enough to be useful.
Proposal
I have suggested before that we may want to have a "fast" and a "slow" checksum algorithm, the fast one will be more like a heartbeat to say either "looks fine" or "hey, completely broken", and the slow one will be used to make sure the repository is 100% correct (a.k.a. we are not missing any useful data and this is not corrupt).
We need the second assurance to be able to have people trusting in it as DR / backup solution.
In today's code we are shelling out from Sidekiq / Rails, but this also makes sense to be part of gitaly and/or as an external tool.
Let's say we make git-checksum binary that we can run at any bare repository and it will give us back a hash we can compare with another repository.
This tool would also give use --fast option and --full to do a complete check. Maybe we want to provide not one but a few hashes in a --detailed (which will be hashed together to form the --full).
Here are some ideas:
- count refs + tags
- XOR all tag hashes into single one
- XOR all branch hashes into a single one
- hash all tag names then XOR them into a single one
- hash all branch names then XOR them into a single one
- hash content from each of these important files in sequence then XOR into a single one: (FETCH_HEAD HEAD config description)
- count all nonstandard refs
- XOR all nonstandard refs into single one
- hash all nonstandard refs name and XOR them into a single one
Some of these "signals" are fast and some can be very slow. If we just perform a count on the numeric values and they don't match (which is super fast to do) we know we are missing data. On a slightly slower level, if we check only existing branches and we see some if missing, it's also faster to detect a failure and initiate a repair action. Moving to slower methods, by also checking hashes in existing branches we cover more. Then we move to data that is less accessed/used like keep-round refs and merge-request refs, but still important in terms of not losing data.
By making it a standalone command, we can share with the community, get some love for doing that and probably receive contributions into a critical part of our code/strategy. If we also trigger this within gitaly, we can remove another remote filesystem dependency we have today (it will be faster to do it locally than within NFS mounted disk).
Links / references
cc @nick.thomas @dbalexandre @digitalmoksha @stanhu @jramsay