Idea: Compressed Counters

Bogofilter counts words, and multiplies those counts after some corrective factors. In those multiplications, the mantissa of the count has a far greater impact than the exponent; meaning, the difference between 1 and 2 is much greater than the difference between 30 and 31. What this suggests is that the log of the values is what matters most (so, the mantissa). Multiplication then becomes addition and the n-th root becomes division by n. This means that it is highly efficient to classify a large number of messages.

Also note that the range of the probabilities is fixed, for instance to the [0;1] range. This means that fixpoint representation is possible: we could use integers. (Are we now loosing the benefits of vector optimisation and graphical processors? Not sure. We're surely simplifying, so that is always good.)

The counting of words made it easy to add and remove entries. This is a more costly process when the factors first need to be raised to a power, then added and reduced to a logarithm. This however, is what should otherwise be done for every message's computation. Messages are more abundant than learning operations (unless every delivery is a learning opportunity).

When counting words in a message in a linear fashion, the repeated use of one word would greatly emphasise its influence in the filter. It may be more appropriate to reduce the importance of repeated words, and use a logarithm of the number of occurrances of a word.

Along the same lines, it may be interesting to look into the multiplication, rather than addition, of the word counters form various messages. When storing the logarithm, this brings us back to the original addition and subtraction. Note that this has originally been motivated as "look how simple", with no rigour but of course having shown its value. We might try another angle too, and see if it also works.

This is a bit vague. It needs more thinking, but the tongue-in-cheek choices made in current Bayesian spam filtering means that the math is not fully prescriptive.