On Collision Resistance and Content Addressable Storage

A new section, "On content addressable storage," was added to obnam.md. I think it contains a few mistakes.

A cryptographically secure hash function has three properties:

Pre-image resistance (find m given Hash(m)),
Second pre-image resistance (find an m2 given m1 and Hash(m1) s.t. Hash(m1) = Hash(m2)), and
Collision resistance (find an m1 and m2 s.t. Hash(m1) = Hash(m2)).

In Obnam, we don't really care about pre-image resistance: the encryption protects a chunk's confidentiality.

SHA-1's collision resistance has been broken. But, its second pre-image resistance is still intact. In fact, MD5's second pre-image resistance is still fine! This is because second pre-image resistance is about finding a second message with the same hash as a fixed message, but collision resistance is about finding two messages with the same hash. The second problem is much less constrained. In particular, collision resistance is susceptible to a birthday paradox (the attacker can choose both messages when creating a collision, but only one message when creating a second pre-image).

So, let's assume that we use SHAX and its collision resistance is compromised. What can an attacker do? I'd say: not so much. The only entity that can write to the object store is the client (presumably, the object store is secured via a merkel tree-like construct and the root is stored on the client), which is a bit different from git, which is a shared repository. Let's assume that an attacker convinces the user to save two files that collide and those are backed up by obnam. This attack would have to be crafted specifically for obnam, because obnam chunks files is an obnam specific way (chunking, meta-data). But, what harm does it do? Probably not much. This attacker controlled data is probably not so relevant to the user. (This isn't a second pre-image attack!)

That doesn't mean that we can't protect ourselves against this attack. The easiest thing to do is to simply use two hash functions and concatenate the results:

h = SHA2(chunk)||SHA3(chunk)

Now the attacker has to collide two hash functions. That's hard^2. It turns out that combining SHA2 and SHA3 in this way is actually pretty secure: SHA2 and SHA3 use completely different constructs. So, an attack on SHA2 is unlikely to apply to SHA3 and vice versa.

SHA2 actually uses a construct similar to SHA-1. But, AFAIK, cryptographers are not worried about SHA2 being successfully attacked in the near future.

Concatenating would also solve your "extreme and rare, but real, case consider a researcher of checksum algorithms scenario". And, if you really wanted, you could even add another protection: hash a known, fixed number with the chunk. That is:

h = Hash(chunk||0x42)

This means that even if Hash(chunk1) == Hash(chunk2), Hash(chunk1||0x42) != Hash(chunk2||0x42).

So, I'd strongly encourage you to reconsider this: To achieve this, Obnam does not use content-addressable storage.

In the end, you need a merkel tree to ensure integrity and that means you have to hash. So by not using content addressable storage, you haven't avoided the problem you think you are avoiding. |