Investigate PostgreSQL full text search as an alternative to trigram indexes
Searching through issues or MRs can get slow very quickly, especially when searching for multiple terms (e.g. "100 SQL"). The main reason is that PostgreSQL simply has to search through giant trigram indexes to find the right data. Further, for every search word we add an AND
for every field we search in (e.g. "100 SQL" becomes something along the lines of WHERE title ILIKE '%100%' OR description ILIKE '%100%'
).
While GitLab EE offers ElasticSearch support it doesn't seem to be much of a priority to get this into a shape where it can run on GitLab.com. Further it's an EE only feature. If we were to use full text search in PostgreSQL it would be a PostgreSQL only feature, we're not going to bother with also using some kind of MySQL full text search system (instead we'll just continue to use ILIKE/LIKE for MySQL).
The downside of this is that we basically have 3 code paths when it comes to searching data:
- For PostgreSQL we'd use full text search
- For MySQL we'd use (I)LIKE
- For EE we'd use ElasticSearch if enabled
This means additional complexity. We'd also have to set up the tsvector indexes in a way similar to the trigram indexes so they are only created when using PostgreSQL.
Other things to consider:
- What language would we create the text search indexes for? Since content can be in any language I don't think we can settle on a specific language
- How are we going to store the tsvector data? From using text search in the past I recall we stored this in a column then indexed that, but I'm not sure if that's (still) necessary
Impact
Based on the above and the comments below the impact of this would be as follows:
-
👎 Fuzzy searching wouldn't really work, though the degree of this depends on how the text search vectors are constructed -
👍 we would be able to support search operations such asfoo & bar
andfoo | bar
, which would allow users to construct more powerful search queries -
👍 it appears to be much faster in my simple test cases -
👍 the text vector indexes are much smaller than the trigram indexes (about 8-10 times smaller, though this may vary) -
👎 more complexity as we now have to search differently based on what database you're using -
👎 dealing with content in different languages becomes more complicated. A language detector could solve this but that's an entirely different beast. From my experience with NLP projects most are an absolute nightmare to run in production. https://github.com/opener-project/language-identifier is pretty decent and not too slow, but it requires JRuby.