Exact match (substring match) code searches in Advanced Search
Solution: Zoekt
UPDATE 2023-05-13: We are working on replacing our code search engine with Zoekt in &9404 as a solution to this problem. You can follow the progress in that epic.
Problem
The following things are not possible today with code search and they are likely things that users want to do:
- Searching for samples of code including special characters (some special characters work some of the time but it's mostly only alphanumeric characters that are searchable)
- You cannot search part of a word, but only whole words. We do support wildcards at the end of a word but it would not be possible to search for a word from the middle
- We cannot match multiple lines (or text that appears only in a single line) in a file because Elasticsearch doesn't have a concept of "lines" and is actually matching multiple fields so your terms may span multiple lines (or appear in the filename for example). For the same reason we can't even reliably highlight the matching line and our line matching today is just a guess and often unhelpful
All these problems generally stem from the fact that Elasticsearch is designed for normal english text and normal english text search engines and not for code. Users are actually expecting code search to be more like a "grep" experience or the "find" function in their IDE. These almost all behave, by default, as an "exact substring match".
Technical problem
Today our advanced code search is powered by Elasticsearch. The way it works is that, during indexing time, we are extracting a set of "tokens" from the text of the file being indexed using a series of regexes and then later when you search Elasticsearch turns your search term into tokens and tries to match those tokens. This design choice in Elasticsearch effectively means we need to know ahead of time what kinds of searches users will be performing before they perform them. We've attempted to do this a lot with the iterations we took in &3621 (closed) but it's not perfect and we'll forever be chasing the long tail of patterns that a user may wish to search for and even then it will only solve problem (1) above so we need to take a drastically different approach.
Solution
We need to populate a search index that can efficiently do partial substring matches. We cannot use any out of the box features in Elasticsearch to do this so we'll need to build the indexing/searching logic ourselves on top of Elasticsearch OR we'll need to use a different technology for the code index.
Technical details in implementation on Elasticsearch
I described in #4175 (closed) a solution for regexes. The same technique will apply for substring matching (which is just a subset of regexes that don't use any regex features).
Efficient searching in Elasticsearch requires searching tokens in the inverted index. In order to make the inverted index contain all the characters necessary for exact matching we will need to tokenize every 3 characters to create an "trigram index".
This is similar to the approach taken by Google's internal code search tool and will give all matches for the regex with fairly good performance and fairly low storage usage. It should be possible to just make a regex that captures every 3 characters to make the trigram index.
The zoekt architecture docs show a clever optimization that looks for trigrams at the correct distance apart. Using the trigram index approach above as well as the positions
index_options we will probably have the necessary data in our index to do this. I didn't find any features in Elasticsearch to do this style of searching but we might be able to implement something like this using a plugin.
Competitive analysis
- GitHub originally did a similar thing to GitLab as they are using Elasticsearch and they list a similar list of limitations at https://docs.github.com/en/github/searching-for-information-on-github/searching-code#considerations-for-code-search . It's also a very popular question in Stack Overflow because many people expect code search to behave this way. It seems like at some point GitHub actually started building a specific "exact match" feature but it was sunset after a beta . They eventually took some time off and realised they needed to re-write everything from scratch https://github.blog/2021-12-08-improving-github-code-search/ https://news.ycombinator.com/item?id=29487237
- Sourcegraph actually supports this and has become quite popular recently and some people are asking us why our search doesn't do this. Their index is based on a fork of zoekt. We could consider basing our search on zoekt somehow as well.
- I'm not exactly sure how OpenGrok works but I believe it supports exact substring matching
- Hound supports exact substring matching as well and it is similarly based on a trigram index