Skip to content

GitLab

Why GitLab
Pricing
Contact Sales
Explore

Sign in
Get free trial

Cloogle
cloogle-web
Issues
#246

Improve name search by adding stemmed documentation to the index

Name search is currently mostly useful if you already know, or remember, (part of) the name of the function you are looking for. In other cases, unification queries are often more helpful, but these are more difficult to formulate correctly. using queries can help in certain cases, but not when you are looking for things that use common types/classes/...

To improve name search we can start indexing documentation as well. This means we need:

A stemmer (https://clean-lang.orgi/pkg/snowball/)
Functionality to split text on words. Need to take camelCase and snake_case into account, but note that queries are often formulated in all-lowercase, e.g. viewinformation. So best would be for viewInformation to be indexed under view, information, and viewinformation
Ignore NLTK stopwords
~~Possibly also ignore "code-specific stopwords" like x, y, n, xs, ys, ...~~ can be done later; no indication that it is needed
When searching, look up words from the query instead of ngrams
Also include stemmed words from the documentation in the index to search on, but with a lower priority than matches in the name itself (this lower priority factor may be hardcoded for simplicity, or it may go through Cloogle.Search.Rank)
Matches on frequent words should carry less weight than matches on infrequent words (Tf-idf)
Add ability to strip license headers from module documentation, otherwise they get indexed
Allow searches with spaces
Need to retain support for n-gram search for funny symbol names like </>, but perhaps with n=2
Maybe this makes it possible to index the common-problems repository in the backend instead of the frontend
Can we get rid of regexes in syntaxSearch (regex computations take up 10% of time and 40% of allocated memory according to the callgraph profiler)
Check diffs for TODOs
~~Add rank setting constraints for queries with both name and unify; otherwise the name weight and the other weights are not related in any way~~ based on tests this seems to be fine
Optimize, e.g. log10 (toReal global_count) is recomputed all the time while we can just store that information directly
Check performance (±25% slower, this is acceptable)
Replace removeContainedEntries with combineContainedEntries: right now the SPDX License type ranks higher than enterInformation for the query information because the Japan_Network_Information_Center_License constructor result scores very high, and is contained in License. combineContainedEntries should not simply remove the contained entry and give the containing entry the highest score. Instead it should give the containing entry a score like own score + contained score / nr of contained entries, and the contained entry should only be removed if the final score of the containing entry is higher.
~~There are conflicts with type X, class X, instance X, exact X, and using X queries~~ then people just have to use X type instead

Edited Nov 04, 2022 by Camil Staps

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Assignee

Select assignees

Time tracking