Skip to content

Add hash command to repocutter

Hello,

For an ongoing SVN -> Git conversion of a repository which contains gigabytes of binary data, I wanted a way to greatly reduce the size of the new production Git repository, while keeping a way to rebuild old content - if ever needed - from a content-addressable data archive.

A reasonable way to achieve that is to replace file contents with a cryptographic hash thereof. Therefore, that's what I set to implement, with SHA-1, SHA-256 and Git blob SHA-1 hash options. For usage against the archive Git repository I have built (flat git-svn import, i.e. no -s / -b / -t / -T arguments), Git blob SHA-1 hash is arguably the most useful one. However, there may be other forms of content-addressable archives which take raw standard hashes, which I implemented first anyway, for initial validation of the concept, before looking up how a Git blob hash is constructed.

Besides building up the regression tests, I have performed spot checking on my repositories: several Git blob SHA-1 hashes match the ones found in the archive repository, and the SHA-1 hashes produced by repocutter -h sha1 hash ... match the output of git show <Git blob SHA-1 hash> | sha1sum -. This works for both small and large files, one of the checked blobs being than 600 MB.

Merge request reports