Add hash command to repocutter (!630) · Merge requests · Eric S. Raymond / reposurgeon

Lionel Debroux requested to merge debrouxl/reposurgeon:repocutter-hash-command into master May 16, 2023

Hello,

For an ongoing SVN -> Git conversion of a repository which contains gigabytes of binary data, I wanted a way to greatly reduce the size of the new production Git repository, while keeping a way to rebuild old content - if ever needed - from a content-addressable data archive.

A reasonable way to achieve that is to replace file contents with a cryptographic hash thereof. Therefore, that's what I set to implement, with SHA-1, SHA-256 and Git blob SHA-1 hash options. For usage against the archive Git repository I have built (flat git-svn import, i.e. no -s / -b / -t / -T arguments), Git blob SHA-1 hash is arguably the most useful one. However, there may be other forms of content-addressable archives which take raw standard hashes, which I implemented first anyway, for initial validation of the concept, before looking up how a Git blob hash is constructed.

Besides building up the regression tests, I have performed spot checking on my repositories: several Git blob SHA-1 hashes match the ones found in the archive repository, and the SHA-1 hashes produced by repocutter -h sha1 hash ... match the output of git show <Git blob SHA-1 hash> | sha1sum -. This works for both small and large files, one of the checked blobs being than 600 MB.

Add hash command to repocutter

Merge request reports