Consider reducing index size by using a different code analyzer
Summary
There's an effort to reduce to Elasticsearch index size (see #3327 (closed)) on gitlab.com. One reason for the huge index size is the excessive number of terms generated by the configured code_analyzer
. It currently uses the whitespace
tokenizer together with a custom code
filter generating additional terms from the terms emitted from the tokenization.
For example, the simple Javascript function call console.log('test')
emits the following 58 tokens:
co
con
cons
conso
consol
console
console.
console.l
console.lo
console.log
console.log(
console.log('
console.log('t
console.log('te
console.log('tes
console.log('test
console.log('test'
console.log('test')
co
con
cons
conso
consol
console
co
con
cons
conso
consol
console
console.
console.l
console.lo
console.log
console.log(
console.log('
console.log('t
console.log('te
console.log('tes
console.log('test
lo
log
lo
log
log(
log('
log('t
log('te
log('tes
log('test
log('test'
log('test')
te
tes
test
te
tes
test
As you can see here, there are a lot of duplicated terms.
The mapping we're proposing in this issue would only emit the following 11 tokens for the same input while still supporting queries for console.log
or log('test')
:
co
con
cons
conso
consol
console
lo
log
te
tes
test
We have used Elasticsearch's Analyze API to compare the tokens produced by the different mappings.
Improvements
The Why
Fewer terms lead to a smaller Elasticsearch index. A smaller Elasticsearch index is faster, easier to maintain and easier to rebuild. Also the indexing time should decrease significantly due to the removal of the code
filter which uses expensive regular expressions to extract sub-terms from to enabled searches in camel-case strings (e.g. to find GitLab when searching for Lab) or strings containing dots (e.g. to find com.gitlab when searching for gitlab). The latter feature can also be supported by using a different tokenizer.
In our tests the index size decreased by 44% due to this change. Also the indexing time was half of the indexing time achieved with the current mapping.
The What
- Instead of using the
whitespace
tokenizer for thecode_analyzer
andcode_search_analyzer
, we're proposing to use a pattern tokenizer splitting the input on all non-word characters. As a consequence no non-word characters will end up in the index. - Instead of using a custom
pattern_capture
filter, we use aword_delimiter
filter to index separate terms for camel-case and snake-case tokens. - With this adapted mapping, we need to ensure that a search for
console.log
does not match documents containing the termsconsole
andlog
but not adjacent to each other. Therefore, queries on fields analyzed with thecode_analyzer
must be whitespace-tokenized by the application and for each resulting token a Match Phrase query should be performed.
Replace whitespace tokenizer with pattern tokenizer
The proposed mapping declares a custom code_tokenizer
that is configured for the code_analyzer
and the code_search_analyzer
:
"code_tokenizer": {
"type": "pattern",
"pattern": "\\W+"
}
Replace pattern_capture filter with word delimiter filter
Currently the code filter consists of seven patterns:
"code": {
"type": "pattern_capture",
"preserve_original": "true",
"patterns": [
"(\\p{Ll}+|\\p{Lu}\\p{Ll}+|\\p{Lu}+)",
"(\\d+)",
"(?=([\\p{Lu}]+[\\p{L}]+))",
"\"((?:\\\"|[^\"]|\\\")*)\"",
"'((?:\\'|[^']|\\')*)'",
"\\.([^.]+)(?=\\.|\\s|\\Z)",
"\\/?([^\\/]+)(?=\\/|\\b)"
]
}
Let's consider them one by one:
(\\p{Ll}+|\\p{Lu}\\p{Ll}+|\\p{Lu}+)
We suppose this one is for supporting substring searches for camel-case or snake-case tokens. This pattern would add the terms "Git" and "Lab" to a term such as "GitLab" or "Git-Lab" emitted by the tokenizer. The new solution will also need to support this feature.
(\\d+)
This one extracts numbers from terms and indexes them as separate terms. For example, it would add '500' as a separate term for the token 'Indy500'. We're in doubt that this is really a valuable feature. Personally, I would not expect this search result.
(?=([\\p{Lu}]+[\\p{L}]+))
We suppose this is just a special case for the camel-case search to support searches for 'CSVParser' to match documents containing a string such as 'customCSVParser'.
\"((?:\\\"|[^\"]|\\\")*)\"
This pattern extracts strings enclosed in double quotes such as "GitLab". As the whitespace tokenizer is currently used, this pattern is useless for inputs such as "GitLab Inc." because the tokenizer already splits this input into '"GitLab' and 'Inc."'. The proposed solution does not index double-quotes and therefore makes this capture pattern obsolete.
'((?:\\'|[^']|\\')*)'
Same as the pattern for double-quotes, just for single-quotes.
\\.([^.]+)(?=\\.|\\s|\\Z)
Extracts terms for a token such as a Java package name. For example, for the token 'com.gitlab.elastic' the terms 'com', 'gitlab' and 'elastic' will be added to the index. With the proposed solution, the tokenizer will already split this string into three separate terms.
\\/?([^\\/]+)(?=\\/|\\b)
This pattern does something similar as the previous pattern just for file or URL paths such as '/usr/lib/gitlab'. With the new mapping it will also be the responsibility of the tokenizer to do the splitting.
So if we remove the code filter entirely, there are just two features we will need to take care of in a replacement:
- Search in camel-case and snake-case tokens.
- Eventually search for numbers within alphanumeric tokens such as 'Indy500'. The proposed solution currently does not cover this, but it could be extended to do so.
To cover the search in camel-case and snake-case tokens, we've added a Word Delimiter token filter to the code_analyzer
as well as the code_search_analyzer
:
"camelCase": {
"type": "word_delimiter",
"split_on_case_change": true,
"preserve_original": true
}
Conceptionally this does the same as the code
analyzer that's currently used. The word delimiter filter automatically splits on an underscore character which is why this solution also works for snake-case tokens. The split_on_case_change option supports the camel-case feature.
Use match_phrase queries for fields analyzed with code_analyzer
To ensure that a query for console.log test
won't match a document containing the terms console
, log
and test
in totally different locations, the queries on fields using the code_analyzer
should be changed to match_phrase queries. We suggest to split the user input by whitespace and then to create a match_phrase query for each resulting token. All match_phrase queries must match the document to still support queries such as "ClassName methodName" where the ClassName is not adjacent to methodName in the document.
Risks
Of course, this is a very involved change that can have subtle side-effects. There should be numerous automated tests that verify that the essential search features are working as expected.
Involved components
The Elasticsearch index settings and mapping need to be updated:
Further, all queries on fields analyzed with the code_analyzer
must be refactored to Match Phrase queries.