README.md 13.7 KB
Newer Older
Sergio Losilla's avatar
Sergio Losilla committed
1
# Improving the avalanche characteristics of the FNV-1a algorithm for optimal string shuffling
Sergio Losilla's avatar
Sergio Losilla committed
2 3 4 5 6 7 8 9 10

13th Nov 2020

Sergio Losilla (sergio.losilla@neutrontherapeutics.com)

Software developer -- Neutron Therapeutics

## Background and motivation

Sergio Losilla's avatar
Sergio Losilla committed
11
[Catch2](https://github.com/catchorg/Catch2) is a widely used open-source test framework written in C++. One of the features it supports is randomizing the order in which test cases are run. The randomization is subset-invariant, meaning that, for a given seed, the relative order between test cases is preserved even if some tests are left out. For example, if for a given seed the order of tests is 2nd then 3rd then 1st, re-running without the 3rd test case will run 2nd then 1st. For this reason, the algorithm used to randomize the order cannot simply use random numbers. Instead, the algorithm sorts the test cases based on a hash generated from the seed and the name of the test case.
Sergio Losilla's avatar
Sergio Losilla committed
12

Sergio Losilla's avatar
Sergio Losilla committed
13
Well-designed software should not have state dependencies; we all know that global state is bad and should be avoided altogether if possible. Nevertheless, sometimes there is nothing we can do about it (see: external dependencies) and running test cases in random order, although not bulletproof, adds a bit of reassurance that we are handling things the right way.
Sergio Losilla's avatar
Sergio Losilla committed
14

Sergio Losilla's avatar
Sergio Losilla committed
15
The algorithm used in Catch2 to create the hash is based on a modified[^1] version of the [FNV-1a algorithm](https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function#FNV-1a_hash), which is a very fast, non-cryptographic hash, and is indeed a [popular choice](https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed). It is one of the usual suspects when one looks for fast hashing algorithms.
Sergio Losilla's avatar
Sergio Losilla committed
16

Sergio Losilla's avatar
Sergio Losilla committed
17
One shortcoming of FNV-1a is that the hashes of strings differing only in the last character are quite similar in their left-most bits. For example, the 64-bit hash for "Give me an a" is `e61148272046e5be`, while the hash for "Give me an e" is `e6114c272046ec8a`[^2]. Compare this to the hashes for "You are number 6" (`ea412090e7d07541`) and "You are number 2" (`ea412490e7d07c0d`). The effect in Catch2 is that test cases whose names differ only in the last character end up being executed immediately before or after each other. Therefore, some of the possible test case permutations are never observed, potentially hiding some deficiencies that would be exposed by the test suite.
Sergio Losilla's avatar
Sergio Losilla committed
18

Sergio Losilla's avatar
Sergio Losilla committed
19
So, the aim of this utterly overthought and over-engineered analysis (done by someone who has very limited understanding of cryptography) is to find an algorithm which generates hashes independent of the input string, while maintaining the rest of the nice properties (speed, uniqueness, nerve, talent. Wait, not the last two) that FNV-1a offers.
Sergio Losilla's avatar
Sergio Losilla committed
20 21 22

## Methods

Sergio Losilla's avatar
Sergio Losilla committed
23
An ideal hashing algorithm would provide all possible test case permutations with equal probability. We will consider four test cases, which can be arranged in a total of 24 (4!) permutations. Our problem is equivalent to verifying that, for a box containing balls numbered from 1 to 24, each number is drawn with equal probability.
Sergio Losilla's avatar
Sergio Losilla committed
24

Sergio Losilla's avatar
Sergio Losilla committed
25
For each hashing algorithm, I calculated a histogram to visually inspect how they compare with our baseline: using a (pseudo-)random number instead of a hash to generate the sorting indices. I also did a quick-and-dirty timing benchmark, to ensure that the winning algorithm does not incur in serious overheads.
Sergio Losilla's avatar
Sergio Losilla committed
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41

## Hashing algorithms

These are the algorithms that will be compared in our analysis:

| Algorithm identifier  | Description                                                                             |
|-----------------------|-----------------------------------------------------------------------------------------|
| `fnv1a_mod`           | The hashing algorithm shipped with Catch2 (as of v2.13.3).                              |
| `fnv1a_mod_tail`      | Same as `fnv1a_mod`, but `"tail"` is appended to each test case name.                   |
| `fnv1a_mod_xor_fold`  | Same as `fnv1a_mod`, but do a XOR folding (see below) on the obtained hash.             |
| `fnv1a_mod_mult_fold` | Same as `fnv1a_mod`, but do a multiplication folding (see below) on the obtained hash.  |
| `fnv1a_new`           | The original FNV-1a algorithm (using the same offset basis), but append the seed as the last character of the string (see below).|
| `fnv1a_new_mult_fold` | Same as `fnv1a_new`, but do a XOR folding (see below) on the obtained hash.             |
| `fnv1a_new_xor_fold`  | Same as `fnv1a_new`, but do a multiplication folding (see below) on the obtained hash.  |
| `random`              | Use a `std::default_random_engine` (using the seed for instantiating it) to generate the hashes.|

Sergio Losilla's avatar
Sergio Losilla committed
42
A few other things were tested (prefixing instead of appending, etc), but they are not worth discussing since they did not provide anything new compared to the other algorithms.
Sergio Losilla's avatar
Sergio Losilla committed
43 44 45

### Folding

Sergio Losilla's avatar
Sergio Losilla committed
46
The authors of the FNV algorithms suggest one recipe to obtain a "somewhat stronger hash"[^3], which consists of doing a XOR on the bits of the upper half of the hash with the bits of the lower hash of the hash. This of course halves the size of the hash, so if 64-bit hashes are needed, the 128-bit hashing algorithm must be used. For our purpose, we are probably talking about hundreds of tests, which would snugly fit into our 32-bit hash space. But even for 100 000 test cases, the probability of a hash collision is 70%[^4]: this is probably fine, since this will only give a tiny edge for test cases being listed first to appear before test cases appearing later. From a speed point of view, 32-bit or 64-bit should not make a difference on 64-bit architectures.
Sergio Losilla's avatar
Sergio Losilla committed
47 48 49 50 51

Some preliminary tests were not very promising for the XOR folding, so an alternative folding using multiplication (overflow, ie keeping only the last 64 bits of the result) instead of XOR was also tested.

### `fnv1a_new`: Back to the origins

Sergio Losilla's avatar
Sergio Losilla committed
52
I tried to go back to the original algorithm, and append the seed at the end of the string, instead of using the seed as offset basis. Spoiler alert: this naive approach failed miserably, but adding a multiplication fold after it is my racing horse.
Sergio Losilla's avatar
Sergio Losilla committed
53 54 55 56 57 58 59

## Results

As I have no idea how to mathematically prove the properties of the hashing algorithms, I resorted to a simple test that hopefully is enough for the current purposes. In summary, I select four strings (the potential test case names), shuffle them a million times (each time with a different random seed coming from `std::random_device`), and count how many times I got each permutation. Each permutation is labeled "0123", "2310", etc. As said above, there are 24 possible permutations. The normalized count is a good estimator of the probability of getting each permutation, which should be 1/24 were the hashes distributed homogeneously. Good algorithms should have histograms looking like a straight line.

### Common usage: disparate test case names

Sergio Losilla's avatar
Sergio Losilla committed
60
The most common occurrence is that tests will have quite varied test case names. For this, I selected
Sergio Losilla's avatar
Sergio Losilla committed
61 62
`"Some name"`,`"Another test"`, `"Feature"` and `"Foobar"`

63
![](Histograms_different.jpg)
Sergio Losilla's avatar
Sergio Losilla committed
64

Sergio Losilla's avatar
Sergio Losilla committed
65
The graph shows quite clearly that for most intents and purposes, this article is a trivial (although admittedly fun) endeavor. All algorithms, including `fnv1a_mod`, are doing Just Fine™. Except for `fnv1a_new`: the final scrambling with the seed does very little to differentiate the hashes, and most of the permutations are never observed. Adding the XOR folding (`fnv1a_new_xor_fold`) helps a lot, but there are still some noticeable biases compared to the rest.
Sergio Losilla's avatar
Sergio Losilla committed
66 67 68 69 70 71 72

### The suite contains two test cases which differ only in the last character

Problems start to show when we have test cases that differ only in the last character. For example, this is the histogram corresponding to shuffling the strings `"a1"`, `"a2"`, `"Test test"` and "`Unit tests"`: For `fnv1a_mod`, `fnv1a_mod_xor_fold`, `fnv1a_new` and  `fnv1a_new_xor_fold`,  only 12 out of the 24 permutations are observed at all. These are the permutations where 0 and 1 are adjacent to each other. If 2 or 3 being executed in between would cause any issues, this would not be revealed by our test suite. `fnv1a_mod_tail` shows some degradation, but it seems to be fine.

In case you are wondering if longer test names make a difference, the answer is no: The only thing that matters is that all characters but the last are the same.

73
![](Histograms_two_similar.jpg)
Sergio Losilla's avatar
Sergio Losilla committed
74 75 76 77 78

### Two pairs

What happens when we have two pairs of similar-looking test names, such as `"a1"`, `"a2"`, `"b1"` and `"b2"`? (Yes, sorry, I was running out of ideas).

79
![](Histograms_two_similar_pairs.jpg)
Sergio Losilla's avatar
Sergio Losilla committed
80

Sergio Losilla's avatar
Sergio Losilla committed
81
Well, exactly what we expect: The algorithms that failed for a single pair show only 8 possible combinations, when 0 is adjacent to 1 and 2 is adjacent to 3. `fnv1a_mod_tail` is starting to show some serious bias, but the pattern is more difficult to recognize. Both algorithms with the multiplication fold are still providing a nice, homogeneous distribution.
Sergio Losilla's avatar
Sergio Losilla committed
82 83 84 85 86

### Four similar test cases

This is the histogram for shuffling `"a1"`, `"a2"`, `"a3"` and "`a4"`:

87
![](Histograms_four_similar.jpg)
Sergio Losilla's avatar
Sergio Losilla committed
88

Sergio Losilla's avatar
Sergio Losilla committed
89
In the unlucky event that we have labeled our tests `"Test 1"`, `"Test 2"`, `"Test A"`, etc, most of the tested algorithms do not work very well, and some permutations will be observed very seldom.
Sergio Losilla's avatar
Sergio Losilla committed
90 91 92

`fnv1a_mod_mult_fold`, `fnv1a_mod_mult_fold` and in particular `fnv1a_mod_tail` perform quite acceptably, although there are visible biases. Quite surprisingly (since I found it by sheer dumb luck), `fnv1a_new_mult_fold` seems indistinguishable from `random`, even so when we zoom in:

93
![](Histograms_four_similar_zoom.jpg)
Sergio Losilla's avatar
Sergio Losilla committed
94 95 96

(Yes, I know that this will not score high on [r/dataisbeautiful](https://reddit.com/r/dataisbeautiful) but I hope that it's enough to get the message across)

Sergio Losilla's avatar
Sergio Losilla committed
97 98
Why does `fnv1a_new_mult_fold` perform remarkably better than `fnv1a_mod_mult_fold`? I can only guess, but it must be that the additional processing after the last character of the string has a sufficiently large effect, which when compounded with the multiplication spreads the hashes far enough.

Sergio Losilla's avatar
Sergio Losilla committed
99 100 101 102 103 104
### Speed

Timing code is one of those things that have me scratching my head again and again, and I am sure I am doing something wrong. But for our purpose of making sure of not doing something embarrassingly stupid, this should suffice.

Here are the average timings (in s) for hashing an empty string, a string consisting of character 'a' repeated 40 times, and character 'a' repeated 80 times (compiled with GCC 10.2.1 with `-O3` and run on my Threadripper 1950X):

Sergio Losilla's avatar
Sergio Losilla committed
105 106 107 108 109 110 111 112 113 114
|                      |          Empty|        40 char|        80 char|
|----------------------|--------------:|--------------:|--------------:|
|`fnv1a_mod`           |       2.27e-08|       5.64e-08|       9.46e-08|
|`fnv1a_mod_mult_fold` |       2.26e-08|       5.60e-08|       9.53e-08|
|`fnv1a_mod_tail`      |       2.30e-08|       5.95e-08|       9.79e-08|
|`fnv1a_mod_xor_fold`  |       2.28e-08|       5.60e-08|       9.61e-08|
|`fnv1a_new`           |       2.28e-08|       5.75e-08|       9.66e-08|
|`fnv1a_new_mult_fold` |       2.32e-08|       5.83e-08|       9.73e-08|
|`fnv1a_new_xor_fold`  |       2.29e-08|       5.77e-08|       9.63e-08|
|`random`              |       2.28e-08|       2.28e-08|       2.27e-08|
Sergio Losilla's avatar
Sergio Losilla committed
115 116 117 118 119 120 121 122 123 124 125 126 127 128 129

I don't really know what the error bars are like, but this tells me that all choices are reasonable enough.

## Some notes on the testing code.

The whole testing code is implemented in `main.cpp` using only standard library headers. I wanted to use C++ to start with exactly [the same code used by Catch2](https://github.com/catchorg/Catch2/blob/0fa133a0c5e065065ef96ac2b6c0284cf5da265d/src/catch2/internal/catch_test_case_registry_impl.cpp#L24).

I took the chance to have some fun with C++, treating it as a declarative scripting language. There are a bunch of functions that take and return functions. I have not kept performance in mind at all, although the hashing functions should be efficient. The source contains a quick-and-dirty JSON printer (the collection of overloaded `<<` operators) which paid off writing when debugging, as it enabled. I think that there might be a couple of interesting things for someone starting to delve into modern C++, but I'm not sure how clear the code is for that.

You will need a C++20 capable compiler. The reason is that I started playing around with [ranges](https://en.cppreference.com/w/cpp/ranges), although I mostly gave up. The only thing that survived was `std::ranges::sort()`, which provides a neat way to sort based on [projections](https://ezoeryou.github.io/blog/article/2019-01-22-ranges-projection.html). This greatly simplifies writing code when a `<` operator is readily available for the data member we want to use as sorting index (which reminds me of Haskell's [`Data.List.sortOn`](https://hackage.haskell.org/package/base-4.14.0.0/docs/Data-List.html#g:21)). The other reason is that I forgot plenty of `typename`s, which apparently [are not required anymore in C++20](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0634r3.html). Good riddance, I say!

## Conclusions

As I am not an expert in cryptography, I can only say that `fnv1a_new_mult_fold` looks good to me. At the very least for subset-invariant shuffling, it does not seem to do worse than the algorithm currently shipped with Catch2. I would not be surprised at all if I made some serious oversight.

Sergio Losilla's avatar
Sergio Losilla committed
130 131
## Footnotes

Sergio Losilla's avatar
Sergio Losilla committed
132 133 134 135
[^1]: The modification consists of using a seed instead of the 64-bit offset basis (14695981039346656037) given by the authors.
[^2]: You can find a FNV-1a calculator [here](https://md5calc.com/hash/fnv1a64).
[^3]: [Other Hash Sizes and XOR Folding](https://tools.ietf.org/html/draft-eastlake-fnv-17#section-3).
[^4]: The probability of having collisions for 100 000 using a 32-bit hash is 0.688 ([you can calculate collision probabilities here](http://davidjohnstone.net/pages/hash-collision-probability)).