Words in all capitals use too much storage in Elasticsearch
I noticed that words like ALL_CAPITALS_WORD
are giving the following tokens:
"all_capitals_word"
"all"
"ll"
"capitals"
"apitals"
"pitals"
"itals"
"tals"
"als"
"ls"
"word"
"ord"
"rd"
In #27918 (closed) we agreed that partial words don't need to be searchable and the (?=([\\p{Lu}]+[\\p{L}]+))
pattern happens to be creating all the partial word tokens (starting from the end) for capital words.
The ideal tokens for this word should be:
"all_capitals_word"
"all"
"capitals"
"word"
We could likely get some large storage savings by improving that pattern. I believe that pattern is also responsible for splitting up words like AllCapitalsWord
which is why it is splitting on every capital letter. It may be difficult to distinguish those cases.
Edited by Dylan Griffith