# Improving the avalanche characteristics of the FNV-1a algorithm for optimal string shuffling
13th Nov 2020
Sergio Losilla (sergio.losilla@neutrontherapeutics.com)
Software developer -- Neutron Therapeutics
## Background and motivation
[Catch2](https://github.com/catchorg/Catch2) is a widely used open-source test framework written in C++. One of the features it supports is randomizing the order in which test cases are run. The randomization is subset-invariant, meaning that, for a given seed, the relative order between test cases is preserved even if some tests are left out. For example, if for a given seed the order of tests is 2nd then 3rd then 1st, re-running without the 3rd test case will run 2nd then 1st. For this reason, the algorithm used to randomize the order cannot simply use random numbers. Instead, the algorithm sorts the test cases based on a hash generated from the seed and the name of the test case.
Well-designed software should not have state dependencies; we all know that global state is bad and should be avoided altogether if possible. Nevertheless, sometimes there is nothing we can do about it (see: external dependencies) and running test cases in random order, although not bulletproof, adds a bit of reassurance that we are handling things the right way.
The algorithm used in Catch2 to create the hash is based on a modified[^1] version of the [FNV-1a algorithm](https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function#FNV-1a_hash), which is a very fast, non-cryptographic hash, and is indeed a [popular choice](https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed). It is one of the usual suspects when one looks for fast hashing algorithms.
One shortcoming of FNV-1a is that the hashes of strings differing only in the last character are quite similar in their left-most bits. For example, the 64-bit hash for "Give me an a" is `e61148272046e5be`, while the hash for "Give me an e" is `e6114c272046ec8a`[^2]. Compare this to the hashes for "You are number 6" (`ea412090e7d07541`) and "You are number 2" (`ea412490e7d07c0d`). The effect in Catch2 is that test cases whose names differ only in the last character end up being executed immediately before or after each other. Therefore, some of the possible test case permutations are never observed, potentially hiding some deficiencies that would be exposed by the test suite.
So, the aim of this utterly overthought and over-engineered analysis (done by someone who has very limited understanding of cryptography) is to find an algorithm which generates hashes independent of the input string, while maintaining the rest of the nice properties (speed, uniqueness, nerve, talent. Wait, not the last two) that FNV-1a offers.
## Methods
An ideal hashing algorithm would provide all possible test case permutations with equal probability. We will consider four test cases, which can be arranged in a total of 24 (4!) permutations. Our problem is equivalent to verifying that, for a box containing balls numbered from 1 to 24, each number is drawn with equal probability.
For each hashing algorithm, I calculated a histogram to visually inspect how they compare with our baseline: using a (pseudo-)random number instead of a hash to generate the sorting indices. I also did a quick-and-dirty timing benchmark, to ensure that the winning algorithm does not incur in serious overheads.
## Hashing algorithms
These are the algorithms that will be compared in our analysis:
| Algorithm identifier | Description |
|-----------------------|-----------------------------------------------------------------------------------------|
| `fnv1a_mod` | The hashing algorithm shipped with Catch2 (as of v2.13.3). |
| `fnv1a_mod_tail` | Same as `fnv1a_mod`, but `"tail"` is appended to each test case name. |
| `fnv1a_mod_xor_fold` | Same as `fnv1a_mod`, but do a XOR folding (see below) on the obtained hash. |
| `fnv1a_mod_mult_fold` | Same as `fnv1a_mod`, but do a multiplication folding (see below) on the obtained hash. |
| `fnv1a_new` | The original FNV-1a algorithm (using the same offset basis), but append the seed as the last character of the string (see below).|
| `fnv1a_new_mult_fold` | Same as `fnv1a_new`, but do a XOR folding (see below) on the obtained hash. |
| `fnv1a_new_xor_fold` | Same as `fnv1a_new`, but do a multiplication folding (see below) on the obtained hash. |
| `random` | Use a `std::default_random_engine` (using the seed for instantiating it) to generate the hashes.|
A few other things were tested (prefixing instead of appending, etc), but they are not worth discussing since they did not provide anything new compared to the other algorithms.
### Folding
The authors of the FNV algorithms suggest one recipe to obtain a "somewhat stronger hash"[^3], which consists of doing a XOR on the bits of the upper half of the hash with the bits of the lower hash of the hash. This of course halves the size of the hash, so if 64-bit hashes are needed, the 128-bit hashing algorithm must be used. For our purpose, we are probably talking about hundreds of tests, which would snugly fit into our 32-bit hash space. But even for 100 000 test cases, the probability of a hash collision is 70%[^4]: this is probably fine, since this will only give a tiny edge for test cases being listed first to appear before test cases appearing later. From a speed point of view, 32-bit or 64-bit should not make a difference on 64-bit architectures.
Some preliminary tests were not very promising for the XOR folding, so an alternative folding using multiplication (overflow, ie keeping only the last 64 bits of the result) instead of XOR was also tested.
### `fnv1a_new`: Back to the origins
I tried to go back to the original algorithm, and append the seed at the end of the string, instead of using the seed as offset basis. Spoiler alert: this naive approach failed miserably, but adding a multiplication fold after it is my racing horse.
## Results
As I have no idea how to mathematically prove the properties of the hashing algorithms, I resorted to a simple test that hopefully is enough for the current purposes. In summary, I select four strings (the potential test case names), shuffle them a million times (each time with a different random seed coming from `std::random_device`), and count how many times I got each permutation. Each permutation is labeled "0123", "2310", etc. As said above, there are 24 possible permutations. The normalized count is a good estimator of the probability of getting each permutation, which should be 1/24 were the hashes distributed homogeneously. Good algorithms should have histograms looking like a straight line.
### Common usage: disparate test case names
The most common occurrence is that tests will have quite varied test case names. For this, I selected
`"Some name"`,`"Another test"`, `"Feature"` and `"Foobar"`
![](Histograms_different.jpg)
The graph shows quite clearly that for most intents and purposes, this article is a trivial (although admittedly fun) endeavor. All algorithms, including `fnv1a_mod`, are doing Just Fineā¢. Except for `fnv1a_new`: the final scrambling with the seed does very little to differentiate the hashes, and most of the permutations are never observed. Adding the XOR folding (`fnv1a_new_xor_fold`) helps a lot, but there are still some noticeable biases compared to the rest.
### The suite contains two test cases which differ only in the last character
Problems start to show when we have test cases that differ only in the last character. For example, this is the histogram corresponding to shuffling the strings `"a1"`, `"a2"`, `"Test test"` and "`Unit tests"`: For `fnv1a_mod`, `fnv1a_mod_xor_fold`, `fnv1a_new` and `fnv1a_new_xor_fold`, only 12 out of the 24 permutations are observed at all. These are the permutations where 0 and 1 are adjacent to each other. If 2 or 3 being executed in between would cause any issues, this would not be revealed by our test suite. `fnv1a_mod_tail` shows some degradation, but it seems to be fine.
In case you are wondering if longer test names make a difference, the answer is no: The only thing that matters is that all characters but the last are the same.
![](Histograms_two_similar.jpg)
### Two pairs
What happens when we have two pairs of similar-looking test names, such as `"a1"`, `"a2"`, `"b1"` and `"b2"`? (Yes, sorry, I was running out of ideas).
![](Histograms_two_similar_pairs.jpg)
Well, exactly what we expect: The algorithms that failed for a single pair show only 8 possible combinations, when 0 is adjacent to 1 and 2 is adjacent to 3. `fnv1a_mod_tail` is starting to show some serious bias, but the pattern is more difficult to recognize. Both algorithms with the multiplication fold are still providing a nice, homogeneous distribution.
### Four similar test cases
This is the histogram for shuffling `"a1"`, `"a2"`, `"a3"` and "`a4"`:
![](Histograms_four_similar.jpg)
In the unlucky event that we have labeled our tests `"Test 1"`, `"Test 2"`, `"Test A"`, etc, most of the tested algorithms do not work very well, and some permutations will be observed very seldom.
`fnv1a_mod_mult_fold`, `fnv1a_mod_mult_fold` and in particular `fnv1a_mod_tail` perform quite acceptably, although there are visible biases. Quite surprisingly (since I found it by sheer dumb luck), `fnv1a_new_mult_fold` seems indistinguishable from `random`, even so when we zoom in:
![](Histograms_four_similar_zoom.jpg)
(Yes, I know that this will not score high on [r/dataisbeautiful](https://reddit.com/r/dataisbeautiful) but I hope that it's enough to get the message across)
Why does `fnv1a_new_mult_fold` perform remarkably better than `fnv1a_mod_mult_fold`? I can only guess, but it must be that the additional processing after the last character of the string has a sufficiently large effect, which when compounded with the multiplication spreads the hashes far enough.
### Speed
Timing code is one of those things that have me scratching my head again and again, and I am sure I am doing something wrong. But for our purpose of making sure of not doing something embarrassingly stupid, this should suffice.
Here are the average timings (in s) for hashing an empty string, a string consisting of character 'a' repeated 40 times, and character 'a' repeated 80 times (compiled with GCC 10.2.1 with `-O3` and run on my Threadripper 1950X):
| | Empty| 40 char| 80 char|
|----------------------|--------------:|--------------:|--------------:|
|`fnv1a_mod` | 2.27e-08| 5.64e-08| 9.46e-08|
|`fnv1a_mod_mult_fold` | 2.26e-08| 5.60e-08| 9.53e-08|
|`fnv1a_mod_tail` | 2.30e-08| 5.95e-08| 9.79e-08|
|`fnv1a_mod_xor_fold` | 2.28e-08| 5.60e-08| 9.61e-08|
|`fnv1a_new` | 2.28e-08| 5.75e-08| 9.66e-08|
|`fnv1a_new_mult_fold` | 2.32e-08| 5.83e-08| 9.73e-08|
|`fnv1a_new_xor_fold` | 2.29e-08| 5.77e-08| 9.63e-08|
|`random` | 2.28e-08| 2.28e-08| 2.27e-08|
I don't really know what the error bars are like, but this tells me that all choices are reasonable enough.
## Some notes on the testing code.
The whole testing code is implemented in `main.cpp` using only standard library headers. I wanted to use C++ to start with exactly [the same code used by Catch2](https://github.com/catchorg/Catch2/blob/0fa133a0c5e065065ef96ac2b6c0284cf5da265d/src/catch2/internal/catch_test_case_registry_impl.cpp#L24).
I took the chance to have some fun with C++, treating it as a declarative scripting language. There are a bunch of functions that take and return functions. I have not kept performance in mind at all, although the hashing functions should be efficient. The source contains a quick-and-dirty JSON printer (the collection of overloaded `<<` operators) which paid off writing when debugging, as it enabled. I think that there might be a couple of interesting things for someone starting to delve into modern C++, but I'm not sure how clear the code is for that.
You will need a C++20 capable compiler. The reason is that I started playing around with [ranges](https://en.cppreference.com/w/cpp/ranges), although I mostly gave up. The only thing that survived was `std::ranges::sort()`, which provides a neat way to sort based on [projections](https://ezoeryou.github.io/blog/article/2019-01-22-ranges-projection.html). This greatly simplifies writing code when a `<` operator is readily available for the data member we want to use as sorting index (which reminds me of Haskell's [`Data.List.sortOn`](https://hackage.haskell.org/package/base-4.14.0.0/docs/Data-List.html#g:21)). The other reason is that I forgot plenty of `typename`s, which apparently [are not required anymore in C++20](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0634r3.html). Good riddance, I say!
## Conclusions
As I am not an expert in cryptography, I can only say that `fnv1a_new_mult_fold` looks good to me. At the very least for subset-invariant shuffling, it does not seem to do worse than the algorithm currently shipped with Catch2. I would not be surprised at all if I made some serious oversight.
## Footnotes
[^1]: The modification consists of using a seed instead of the 64-bit offset basis (14695981039346656037) given by the authors.
[^2]: You can find a FNV-1a calculator [here](https://md5calc.com/hash/fnv1a64).
[^3]: [Other Hash Sizes and XOR Folding](https://tools.ietf.org/html/draft-eastlake-fnv-17#section-3).
[^4]: The probability of having collisions for 100 000 using a 32-bit hash is 0.688 ([you can calculate collision probabilities here](http://davidjohnstone.net/pages/hash-collision-probability)).