Improve name search by adding stemmed documentation to the index
Name search is currently mostly useful if you already know, or remember, (part of) the name of the function you are looking for. In other cases, unification queries are often more helpful, but these are more difficult to formulate correctly. using
queries can help in certain cases, but not when you are looking for things that use common types/classes/...
To improve name search we can start indexing documentation as well. This means we need:
-
A stemmer (https://clean-lang.orgi/pkg/snowball/) -
Functionality to split text on words. Need to take camelCase
andsnake_case
into account, but note that queries are often formulated in all-lowercase, e.g.viewinformation
. So best would be forviewInformation
to be indexed underview
,information
, andviewinformation
-
Ignore NLTK stopwords -
Possibly also ignore "code-specific stopwords" likecan be done later; no indication that it is neededx
,y
,n
,xs
,ys
, ... -
When searching, look up words from the query instead of ngrams -
Also include stemmed words from the documentation in the index to search on, but with a lower priority than matches in the name itself (this lower priority factor may be hardcoded for simplicity, or it may go through Cloogle.Search.Rank) -
Matches on frequent words should carry less weight than matches on infrequent words (Tf-idf) -
Add ability to strip license headers from module documentation, otherwise they get indexed -
Allow searches with spaces -
Need to retain support for n-gram search for funny symbol names like </>
, but perhaps with n=2 -
Maybe this makes it possible to index the common-problems repository in the backend instead of the frontend -
Can we get rid of regexes in syntaxSearch
(regex computations take up 10% of time and 40% of allocated memory according to the callgraph profiler) -
Check diffs for TODOs -
Add rank setting constraints for queries with bothbased on tests this seems to be finename
andunify
; otherwise the name weight and the other weights are not related in any way -
Optimize, e.g. log10 (toReal global_count)
is recomputed all the time while we can just store that information directly -
Check performance (±25% slower, this is acceptable) -
Replace removeContainedEntries
withcombineContainedEntries
: right now the SPDXLicense
type ranks higher thanenterInformation
for the queryinformation
because theJapan_Network_Information_Center_License
constructor result scores very high, and is contained inLicense
.combineContainedEntries
should not simply remove the contained entry and give the containing entry the highest score. Instead it should give the containing entry a score likeown score + contained score / nr of contained entries
, and the contained entry should only be removed if the final score of the containing entry is higher. -
There are conflicts withthen people just have to usetype X
,class X
,instance X
,exact X
, andusing X
queriesX type
instead
Edited by Camil Staps