Proper validation to confirm that a repo was moved successfully
Background
As of writing this issue, July 9th 2019, we have 36 file
servers (a.k.a Gitaly nodes) that host our users' GitLab repositories. We have agreed to not let any node's disk be used more than 75%. If a node's disk utilization exceeds 75%, we will start rebalancing our nodes by moving repos around (and to new nodes, when necessary).
Problem
When executing: https://gitlab.com/gitlab-com/gl-infra/production/issues/875, one of the things we noticed was that we don't exactly have a solid/robust way to validate and confirm a repo was moved successfully after the script: storage_rebalance.rb
completes running. The flow has been like this:
- SRE1 would follow the SOP
- Kick off a move of ??? GB from one server to another and wait.
- Once the script finishes (assuming successful), SRE1 would prepare a file that contains a list repos that were touched by the script (that have been renamed to
*+moved*.git)
and ask SRE2 to review them before SRE1 deletes those repos from the source server. - SRE2 would then review few repos manually and either signs off for deletion OR reports suspicious findings
- If suspicious findings were noted, SRE1 would then check the questionable repos 1 by 1 and start digging in.
Types of validations that have been done so far and their findings:
- Check the repo in Rails console and make sure it is pointing to the target server | This has worked smoothly.
- Check the repo in Rails console and make sure it is NOT marked as read-only | This has worked smoothly.
- Compared repo contents at the highest level (maxdepth=1) and found missing hooks symlink, different
./git/config
and target repo was missing files in the./git
directory | This was confirmed to be okay by the devs in https://gitlab.com/gitlab-com/gl-infra/production/issues/875#note_184885141. - Running
git fsck
on source and target nodes report different objects | This was due to dangling commits/blobs. - Source and target repo size is different | This was due to target repo not carrying over
./git/objects/
SHA directories.
Goal
The goal of this issue is to define a series of things we would want to validate to confirm that a repo was migrated successfully - so that every SRE has the exact same set of things to check to do the validation. That way, if the current SREs (who are handling the rebalancing effort) are not doing it, other team members can still do the exact same thing. As an iteration, we can look at automating that in the future.