Geo: Improving Repository Checksum

Description

When discussing with @nick.thomas about our checksums, he pointed out to me that we are not including every aspect of it. As a matter of fact, we are excluding any refs/merge-request and refs/keep-around from our checks: https://gitlab.com/gitlab-org/gitlab-ee/blob/26ed2d2aa0f22fba27c1c7f991aae1b12e2aaa78/ee/lib/gitlab/git/checksum.rb#L34

We need to improve that while keeping it fast enough to be useful.

Proposal

I have suggested before that we may want to have a "fast" and a "slow" checksum algorithm, the fast one will be more like a heartbeat to say either "looks fine" or "hey, completely broken", and the slow one will be used to make sure the repository is 100% correct (a.k.a. we are not missing any useful data and this is not corrupt).

We need the second assurance to be able to have people trusting in it as DR / backup solution.

In today's code we are shelling out from Sidekiq / Rails, but this also makes sense to be part of gitaly and/or as an external tool.

Let's say we make git-checksum binary that we can run at any bare repository and it will give us back a hash we can compare with another repository. This tool would also give use --fast option and --full to do a complete check. Maybe we want to provide not one but a few hashes in a --detailed (which will be hashed together to form the --full).

Here are some ideas:

  • count refs + tags
  • XOR all tag hashes into single one
  • XOR all branch hashes into a single one
  • hash all tag names then XOR them into a single one
  • hash all branch names then XOR them into a single one
  • hash content from each of these important files in sequence then XOR into a single one: (FETCH_HEAD HEAD config description)
  • count all nonstandard refs
  • XOR all nonstandard refs into single one
  • hash all nonstandard refs name and XOR them into a single one

Some of these "signals" are fast and some can be very slow. If we just perform a count on the numeric values and they don't match (which is super fast to do) we know we are missing data. On a slightly slower level, if we check only existing branches and we see some if missing, it's also faster to detect a failure and initiate a repair action. Moving to slower methods, by also checking hashes in existing branches we cover more. Then we move to data that is less accessed/used like keep-round refs and merge-request refs, but still important in terms of not losing data.

By making it a standalone command, we can share with the community, get some love for doing that and probably receive contributions into a critical part of our code/strategy. If we also trigger this within gitaly, we can remove another remote filesystem dependency we have today (it will be faster to do it locally than within NFS mounted disk).

Links / references

cc @nick.thomas @dbalexandre @digitalmoksha @stanhu @jramsay

Assignee Loading
Time tracking Loading