Remove partial word matching from code search and require user to explicitly search prefix if desired
Summary
Currently the code search uses ngrams to allow searching for prefixes as well as full matches. This takes up a lot of storage and can be replaced with a prefix search. Prefix search is already supported in GitLab by adding a *
at the end of your search term and it's already documented so we assume users will use this if that is their intention.
Read more detail at #27918 (comment 323128495)
Improvements
Remove edgeNGram_filter
in the Elasticsearch mapping. This saves on considerable storage given there are many partial word tokens being stored today. It also fixes #28419 (closed) .
The mapping used for our tests: replace_ngrams_with_index_prefixes.json
Risks
Users may be surprised that their partial word searches no longer match. But we assume this risk since it seems this was accidental anyway and we have always documented that partial word searches should be done using *
at the end.
We decided not to add the *
automatically for every search since there is no way to know that they want this and it adds a performance cost to every query when most people won't be trying to do this.
If prefix searches are common users may notice that they are slower than before when we had already tokenised all partial matches. This can be improved by adding index_prefixes to speed things up (which still uses less storage than edgeNGram_filter
) but we would want to see data to suggest that prefix searches are common and slow before we wanted to add this.