Consider reducing index size by using a different code analyzer

Summary

There's an effort to reduce to Elasticsearch index size (see #3327 (closed)) on gitlab.com. One reason for the huge index size is the excessive number of terms generated by the configured code_analyzer. It currently uses the whitespace tokenizer together with a custom code filter generating additional terms from the terms emitted from the tokenization.

For example, the simple Javascript function call console.log('test') emits the following 58 tokens:

co
con
cons
conso
consol
console
console.
console.l
console.lo
console.log
console.log(
console.log('
console.log('t
console.log('te
console.log('tes
console.log('test
console.log('test'
console.log('test')
co
con
cons
conso
consol
console
co
con
cons
conso
consol
console
console.
console.l
console.lo
console.log
console.log(
console.log('
console.log('t
console.log('te
console.log('tes
console.log('test
lo
log
lo
log
log(
log('
log('t
log('te
log('tes
log('test
log('test'
log('test')
te
tes
test
te
tes
test

As you can see here, there are a lot of duplicated terms.

The mapping we're proposing in this issue would only emit the following 11 tokens for the same input while still supporting queries for console.log or log('test'):

co
con
cons
conso
consol
console
lo
log
te
tes
test

We have used Elasticsearch's Analyze API to compare the tokens produced by the different mappings.

Improvements

The Why

Fewer terms lead to a smaller Elasticsearch index. A smaller Elasticsearch index is faster, easier to maintain and easier to rebuild. Also the indexing time should decrease significantly due to the removal of the code filter which uses expensive regular expressions to extract sub-terms from to enabled searches in camel-case strings (e.g. to find GitLab when searching for Lab) or strings containing dots (e.g. to find com.gitlab when searching for gitlab). The latter feature can also be supported by using a different tokenizer.

In our tests the index size decreased by 44% due to this change. Also the indexing time was half of the indexing time achieved with the current mapping.

The What

Instead of using the whitespace tokenizer for the code_analyzer and code_search_analyzer, we're proposing to use a pattern tokenizer splitting the input on all non-word characters. As a consequence no non-word characters will end up in the index.
Instead of using a custom pattern_capture filter, we use a word_delimiter filter to index separate terms for camel-case and snake-case tokens.
With this adapted mapping, we need to ensure that a search for console.log does not match documents containing the terms console and log but not adjacent to each other. Therefore, queries on fields analyzed with the code_analyzer must be whitespace-tokenized by the application and for each resulting token a Match Phrase query should be performed.

Replace whitespace tokenizer with pattern tokenizer

The proposed mapping declares a custom code_tokenizer that is configured for the code_analyzer and the code_search_analyzer:

"code_tokenizer": {
    "type": "pattern",
    "pattern": "\\W+"
}

Replace pattern_capture filter with word delimiter filter

Currently the code filter consists of seven patterns:

"code": {                                 
  "type": "pattern_capture",              
  "preserve_original": "true",            
  "patterns": [                           
    "(\\p{Ll}+|\\p{Lu}\\p{Ll}+|\\p{Lu}+)",
    "(\\d+)",                             
    "(?=([\\p{Lu}]+[\\p{L}]+))",          
    "\"((?:\\\"|[^\"]|\\\")*)\"",         
    "'((?:\\'|[^']|\\')*)'",              
    "\\.([^.]+)(?=\\.|\\s|\\Z)",          
    "\\/?([^\\/]+)(?=\\/|\\b)"            
  ]                                       
}

Let's consider them one by one:

(\\p{Ll}+|\\p{Lu}\\p{Ll}+|\\p{Lu}+)

We suppose this one is for supporting substring searches for camel-case or snake-case tokens. This pattern would add the terms "Git" and "Lab" to a term such as "GitLab" or "Git-Lab" emitted by the tokenizer. The new solution will also need to support this feature.

(\\d+)

This one extracts numbers from terms and indexes them as separate terms. For example, it would add '500' as a separate term for the token 'Indy500'. We're in doubt that this is really a valuable feature. Personally, I would not expect this search result.

(?=([\\p{Lu}]+[\\p{L}]+))

We suppose this is just a special case for the camel-case search to support searches for 'CSVParser' to match documents containing a string such as 'customCSVParser'.

\"((?:\\\"|[^\"]|\\\")*)\"

This pattern extracts strings enclosed in double quotes such as "GitLab". As the whitespace tokenizer is currently used, this pattern is useless for inputs such as "GitLab Inc." because the tokenizer already splits this input into '"GitLab' and 'Inc."'. The proposed solution does not index double-quotes and therefore makes this capture pattern obsolete.

'((?:\\'|[^']|\\')*)'

Same as the pattern for double-quotes, just for single-quotes.

\\.([^.]+)(?=\\.|\\s|\\Z)

Extracts terms for a token such as a Java package name. For example, for the token 'com.gitlab.elastic' the terms 'com', 'gitlab' and 'elastic' will be added to the index. With the proposed solution, the tokenizer will already split this string into three separate terms.

\\/?([^\\/]+)(?=\\/|\\b)

This pattern does something similar as the previous pattern just for file or URL paths such as '/usr/lib/gitlab'. With the new mapping it will also be the responsibility of the tokenizer to do the splitting.

So if we remove the code filter entirely, there are just two features we will need to take care of in a replacement:

Search in camel-case and snake-case tokens.
Eventually search for numbers within alphanumeric tokens such as 'Indy500'. The proposed solution currently does not cover this, but it could be extended to do so.

To cover the search in camel-case and snake-case tokens, we've added a Word Delimiter token filter to the code_analyzer as well as the code_search_analyzer:

"camelCase": {                 
  "type": "word_delimiter",    
  "split_on_case_change": true,
  "preserve_original": true    
}

Conceptionally this does the same as the code analyzer that's currently used. The word delimiter filter automatically splits on an underscore character which is why this solution also works for snake-case tokens. The split_on_case_change option supports the camel-case feature.

Use match_phrase queries for fields analyzed with code_analyzer

To ensure that a query for console.log test won't match a document containing the terms console, log and test in totally different locations, the queries on fields using the code_analyzer should be changed to match_phrase queries. We suggest to split the user input by whitespace and then to create a match_phrase query for each resulting token. All match_phrase queries must match the document to still support queries such as "ClassName methodName" where the ClassName is not adjacent to methodName in the document.

Risks

Of course, this is a very involved change that can have subtle side-effects. There should be numerous automated tests that verify that the essential search features are working as expected.

Involved components

The Elasticsearch index settings and mapping need to be updated:

current_mapping.json

proposed_mapping.json

Further, all queries on fields analyzed with the code_analyzer must be refactored to Match Phrase queries.

Edited Sep 12, 2019 by Thomas Hammerl