Idea: Fixed-size storage
Bogofilter stores words with spam counts. A powerful property of this is that it is possible to "add" or "subtract" and email from the set, and to learn that it belongs in another category. This property should definately be preserved.
It is practical to use fixed-size storage, rather than ever-growing space. This can be achieved if, instead of storing the words, we hash the words and use the numeric result to evenly spread the contents at those locations.
Clashes. There will be hashes that cover more than one word. The result of that would be that their results cannot be distinguished. So we might not be able to distinguish the word tea
from rainbow
, for instance. This should not occur very frequently. If we have N words evenly spread into M slots, then N/M is the risk of finding a clash. Tongue-in-cheek: 4000 entries should work well for. We might go a little over or under to find a prime modulus, and we might store a bit of metadata, such as a timestamp for the last update and perhaps a privacy-improving seed for the hash function.
Singletons. This approach is really helpful with words that occur only once in a lifetime. Bogofilter stores many of those, and because it counts words it cannot forget them. Merging such words with other words means that they waste no storage, and also reduce searching times.
Efficiency. Another helpful bit is that a message can be scanned into a fixed vector of the same length as the hash vector. The two then get combined with a vector operation, something that processors can do really well. We might even use a graphical card to filter spam! Also, vector operations might be cleverly reordered.