If there are many matches with the substring you are searching for in the group auto complete dropdown then you cannot find the specific group and there is no way to enable Elasticsearch for that group.
In general I think this bug happens across many auto complete dropdowns in GitLab. We should probably put exact matches at the top of the list to ensure it's always possible to find anything.
@DylanGriffith that looks like a backend bug. We use ILIKE without any ordering there:
[40] pry(main)> Namespace.search('abc123') Namespace Load (0.7ms) SELECT "namespaces".* FROM "namespaces" WHERE ("namespaces"."name" ILIKE '%abc123%' OR "namespaces"."path" ILIKE '%abc123%')
Also, we could think about allowing searching by id. Currently it's almost impossible to pick one specific project if we have a lot of similar/identical names.
I'm wondering if we could change the strategy here to use POSITION('abc123' in "namespace"."path") to have a data point to order from (the index of the substring) such as a perfect match would always yield 0 and thus be at the top.
I did a bit of research and it seems this approach would be valid if we can leverage GIN indexes and still do case insensitive search.
Looking at the structure.sql, it seems our GIN indexes are not built using lower(), which means they are case-sensitive. We circumvent that in by using ILIKE in the search query.
After some testing, it seems POSITION() never ends-up using any index, which makes it run as a SeqScan which would slow down the query a lot.
I'm looking at other solutions.
do an UNION, searching for <query>% then %<query>% and removing any dupplicates, ordering the first batch before the second batch.
rank by coverage, which I would define as len(column) - len(query) / len(column). A lower number would mean that there are less outstanding characters in the results.
I'm looking at how we could add that logic to our fuzzy_search algorithm, as it uses multiple columns to do the search. I think a simple min(...) of all each column's coverage would do it just fine for ranking purposes.
I'm pretty sure we'll want to put that under a feature flag.
@mbergeron that's a good option. What do you think about adding to that query something like ORDER BY title = 'war' DESC, ... to be sure that even if we have a lot of results with the same length an exact match would always be on top?
@dgruzd the problem I'm facing right now is that we can't control the searched fields from the API, which are:
Namespace: name, path
Project: name, path, description or routes.name, routes.path, description
That can make weird scenarios where search in the description will always yield a low ranking, because of the size of the searched text.
Another problem I'm facing is that the API exposes a sorting mechanism for the consumer, such as the order of the ranking is overridden (we could probably add a :rank/:match to circumvent this)
I'm wondering if we could supply hints to the API to restrict the search (IMO we only want to search by path_with_namespace here) and go from there?
Bummer, I'm afraid this approach cannot work when there are multiple search columns: we are matching with OR so as soon as there is a valid hit on any column, the rank is computed for all the columns, even if the text doesn't match in it.
@dgruzd@DylanGriffith This problem looks like a good rabit hole. I'm starting to think that we should simply implement the pagination logic in the front-end and be done with this.
I'll push my current code so you can take a look at the implementation, but I don't think we'll be able to use it.
The search results are actually good (too large even), but the problem is that we only show the top 11.
@mbergeron what do you think about adding an ability to search by id? I think it could help a lot with this specific problem and it might be useful in other parts of our application.
@dgruzd I think this is where we'll end up, but I'd go with something more legible, and that matches what the dropdown shows.
I had a talk with @DylanGriffith yesterday about that, and improving the API might be a better way to go.
With that said, I took a look at the routes table, and this seems like the best candidate for this use-case, as it contains a collection of unique paths for each type of entity.
I think I'll simply change the front-end to use the AutocompleteController instead of the API.
In general I think this bug happens across many auto complete dropdowns in GitLab. We should probably put exact matches at the top of the list to ensure it's always possible to find anything.
This is something I've also experienced whenever you want to mention another user for instance.
@changzhengliu I have a question regarding the Namespace field in there. It seems to me this should be a Group field, as I don't see any use-case of using a User's namespace, as there is no permission model on it.
@mbergeron I'm not sure I fully understand. I don't see any reason we can't index a user's personal namespace. This whitelist isn't really about the group as much as it is about the projects within the group. Basically this whitelist means that all projects within this group will be indexed in Elasticsearch. As such they'll get all the search features available when the project is indexed in Elasticsearch (advanced search syntax etc.).
What you may be pointing out, legitimately, though, is that there is no way in GitLab's UI to search in a user's namespace and can only scope searches to groups or projects. That is true but if we enable indexing for all of a user's projects they could still do project scoped advanced search in all of their projects.
Basically I don't think there is much of a good reason to limit this to groups since it does work fine for any kind of namespace. With that said I'm not totally clear what you meant by "there is no permission model on it".
What you may be pointing out, legitimately, though, is that there is no way in GitLab's UI to search in a user's namespace and can only scope searches to groups or projects. That is true but if we enable indexing for all of a user's projects they could still do project scoped advanced search in all of their projects.
Exactly, that means we have to gate this endpoint for admin only and that limits the re-usability of it in order to prevent leaking all of the usernames.
I think I understand your point — I'm trying to get an idea on how this feature is used. I felt like this was mainly a Gitlab.com operational filter, such as we can roll-out the usage of this, and as such, I think that whitelisting user's namespaces is pretty much irrelevant.