Improve autocomplete search for projects with scoring

There are several pages where we show project selector (dropdown):

Value Stream Analytics
Productivity Analytics
When moving an issue

Mainly, these two endpoints are used to populate the project list:

/autocomplete/projects (global, not part of the public API, for the UI only)
/api/v4/groups/:group_id/projects (group level)

The underlying database queries are quite similar:

Filter projects by joining the authorized_projects table.
Filter project attributes with a wildcard query.
Order the result. (by star count or name or last activity)

Example wildcard query:

"projects"."path" ILIKE '%gitl%'
OR "projects"."name" ILIKE '%gitl%'
OR "projects"."description" ILIKE '%gitl%'

The wildcard query is rather complex and cannot be supported by a simple index, the database will collect all project records (with authorization record) and filter them by the wildcard query. This means adding extra calculations and ordering might not affect the performance that much.

Notice the Filter when scanning the index_projects_on_namespace_id_and_id index: Query plan

Sorting by name

When a user searches for a particular project (sorting by name or other column), often results in unexpected search results.

Example: Search for gitlab in the gitlab-org projects. API call

The first result is a project called "🔒 gitaly". The project contains the gitlab string in the description and the padlock makes it the first element when sorting alphabetically.

Scoring

We can get better results if we try to score the results based on how closely the search text matches the record.

Scoring rules (needs tweaking):

Exact match in project name: 300
Exact match in project path: 300
Exact case insensitive match in project name: 150
Partial case insensitive match (starts with) in project name: 100
Partial case insensitive match (starts with) in project path: 100
Project case insensitive path contains the search string: 50
Project case insensitive description contains the search string: 10
Note: For the partial matches, take into account the match size.

Summing the values would give us the score. If we order the results by score, we'll get more relevant projects on top.

Example query when the user searches for gitla:

select name, path, ((CASE WHEN "projects"."name"='gitla' THEN 300 ELSE 0 END) +
(CASE WHEN "projects"."path"='gitla' THEN 300 ELSE 0 END) +
(CASE WHEN LOWER("projects"."name")=LOWER('gitla') THEN 150 ELSE 0 END) +
(CASE WHEN strpos(LOWER("projects"."name"), LOWER('gitla')) = 1 THEN 100 - ABS(LENGTH('gitla') - LENGTH("projects"."name")) ELSE 0 END) +
(CASE WHEN strpos(LOWER("projects"."path"), LOWER('gitla')) = 1 THEN 100 - ABS(LENGTH('gitla') - LENGTH("projects"."path")) ELSE 0 END) +
(CASE WHEN strpos(LOWER("projects"."name"), LOWER('gitla')) != 1 AND strpos(LOWER("projects"."name"), LOWER('gitla')) != 0 THEN 50 - ABS(LENGTH('gitla') - LENGTH("projects"."name")) ELSE 0 END) +
(CASE WHEN strpos(LOWER("projects"."description"), LOWER('gitla')) = 1 THEN 10 ELSE 0 END)) as score 
FROM projects WHERE namespace_id = 9970 ORDER BY score DESC;

Top 5 records ordered by score:

                  name                  |                  path                  | score
----------------------------------------+----------------------------------------+-------
 GitLab                                 | gitlab                                 |   208
 GitLab-Git                             | gitlab-git                             |   200
 GitLab FOSS                            | gitlab-foss                            |   198
 GitLab Docs                            | gitlab-docs                            |   198
 gitlab_emoji                           | gitlab_emoji                           |   196

Results for gitlab:

                  name                  |                  path                  | score
----------------------------------------+----------------------------------------+-------
 GitLab                                 | gitlab                                 |   660
 GitLab-Git                             | gitlab-git                             |   202
 GitLab FOSS                            | gitlab-foss                            |   200
 GitLab Docs                            | gitlab-docs                            |   200
 gitlab_emoji                           | gitlab_emoji                           |   198

Results for git:

                  name                  |                  path                  | score
----------------------------------------+----------------------------------------+-------
 Git                                    | git                                    |   660
 gitaly                                 | gitaly                                 |   204
 GitLab                                 | gitlab                                 |   204
 GitLab-Git                             | gitlab-git                             |   196
 GitLab Docs                            | gitlab-docs                            |   194

Results for pages:

                  name                  |                  path                  | score
----------------------------------------+----------------------------------------+-------
 gitlab-pages                           | gitlab-pages                           |    43
 gitlab-pages-proto                     | gitlab-pages-proto                     |    37
 gitlab-shell                           | gitlab-shell                           |     0
 GitLab CI                              | gitlab-ci                              |     0
 GitLab CI Runner                       | gitlab-ci-runner                       |     0

Performance

Query to filter projects within a group:

SELECT "projects".*
FROM "projects"
WHERE "projects"."namespace_id" IN
    (WITH RECURSIVE "base_and_descendants" AS (
                                                 (SELECT "namespaces".*
                                                  FROM "namespaces"
                                                  WHERE "namespaces"."type" = 'Group'
                                                    AND "namespaces"."id" = 9970)
                                               UNION
                                                 (SELECT "namespaces".*
                                                  FROM "namespaces",
                                                       "base_and_descendants"
                                                  WHERE "namespaces"."type" = 'Group'
                                                    AND "namespaces"."parent_id" = "base_and_descendants"."id")) SELECT "id"
     FROM "base_and_descendants" AS "namespaces")
  AND (EXISTS
         (SELECT 1
          FROM "project_authorizations"
          WHERE "project_authorizations"."user_id" = 4156052
            AND (project_authorizations.project_id = projects.id))
       OR projects.visibility_level IN (0,
                                        10,
                                        20))
  AND (("projects"."path" ILIKE '%gitl%'
        OR "projects"."name" ILIKE '%gitl%')
       OR "projects"."description" ILIKE '%gitl%')
  AND "projects"."archived" = FALSE
  AND "projects"."pending_delete" = FALSE
ORDER BY "projects"."name" ASC,
         "projects"."id" ASC
LIMIT 50
OFFSET 0

8ms, Plan

Query to filter projects within a group with scoring:

SELECT "projects".*, ((CASE WHEN "projects"."name"='gitl' THEN 300 ELSE 0 END) +
(CASE WHEN "projects"."path"='gitl' THEN 300 ELSE 0 END) +
(CASE WHEN LOWER("projects"."name")=LOWER('gitl') THEN 150 ELSE 0 END) +
(CASE WHEN strpos(LOWER("projects"."name"), LOWER('gitl')) = 1 THEN 100 - ABS(LENGTH('gitl') - LENGTH("projects"."name")) ELSE 0 END) +
(CASE WHEN strpos(LOWER("projects"."path"), LOWER('gitl')) = 1 THEN 100 - ABS(LENGTH('gitl') - LENGTH("projects"."path")) ELSE 0 END) +
(CASE WHEN strpos(LOWER("projects"."name"), LOWER('gitl')) != 1 AND strpos(LOWER("projects"."name"), LOWER('gitl')) != 0 THEN 50 - ABS(LENGTH('gitl') - LENGTH("projects"."name")) ELSE 0 END) +
(CASE WHEN strpos(LOWER("projects"."description"), LOWER('gitl')) = 1 THEN 10 ELSE 0 END)) as score
FROM "projects"
WHERE "projects"."namespace_id" IN
    (WITH RECURSIVE "base_and_descendants" AS (
                                                 (SELECT "namespaces".*
                                                  FROM "namespaces"
                                                  WHERE "namespaces"."type" = 'Group'
                                                    AND "namespaces"."id" = 9970)
                                               UNION
                                                 (SELECT "namespaces".*
                                                  FROM "namespaces",
                                                       "base_and_descendants"
                                                  WHERE "namespaces"."type" = 'Group'
                                                    AND "namespaces"."parent_id" = "base_and_descendants"."id")) SELECT "id"
     FROM "base_and_descendants" AS "namespaces")
  AND (EXISTS
         (SELECT 1
          FROM "project_authorizations"
          WHERE "project_authorizations"."user_id" = 4156052
            AND (project_authorizations.project_id = projects.id))
       OR projects.visibility_level IN (0,
                                        10,
                                        20))
  AND (("projects"."path" ILIKE '%gitl%'
        OR "projects"."name" ILIKE '%gitl%')
       OR "projects"."description" ILIKE '%gitl%')
  AND "projects"."archived" = FALSE
  AND "projects"."pending_delete" = FALSE
ORDER BY score DESC
LIMIT 50
OFFSET 0

9ms, Plan

The sorting happens in the memory for both cases. Scoring is adds a slight overhead.

Implementation Plan

Note: Using elasticsearch would be better tool for the job, however at the moment elasticsearch is not 100% available within GitLab.

Introduce a new sorting option to the project search API: relevance (defaults to DESC order).
When the relevance sort option is given, add the scoring calculation to the query.
If no search text is given, fall back to order by name (ASC).
Mark the relevance sort option as experimental or interrnal for now in the docs.

Related issues

Edited Aug 03, 2020 by Dan Jensen