Improve autocomplete search for projects with scoring
There are several pages where we show project selector (dropdown):
- Value Stream Analytics
- Productivity Analytics
- When moving an issue
Mainly, these two endpoints are used to populate the project list:
-
/autocomplete/projects
(global, not part of the public API, for the UI only) /api/v4/groups/:group_id/projects (group level)
The underlying database queries are quite similar:
- Filter projects by joining the
authorized_projects
table. - Filter project attributes with a wildcard query.
- Order the result. (by star count or name or last activity)
Example wildcard query:
"projects"."path" ILIKE '%gitl%'
OR "projects"."name" ILIKE '%gitl%'
OR "projects"."description" ILIKE '%gitl%'
The wildcard query is rather complex and cannot be supported by a simple index, the database will collect all project records (with authorization record) and filter them by the wildcard query. This means adding extra calculations and ordering might not affect the performance that much.
Notice the Filter
when scanning the index_projects_on_namespace_id_and_id
index: Query plan
Sorting by name
When a user searches for a particular project (sorting by name or other column), often results in unexpected search results.
Example: Search for gitlab
in the gitlab-org
projects. API call
The first result is a project called "gitlab
string in the description and the padlock makes it the first element when sorting alphabetically.
Scoring
We can get better results if we try to score the results based on how closely the search text matches the record.
Scoring rules (needs tweaking):
- Exact match in project name: 300
- Exact match in project path: 300
- Exact case insensitive match in project name: 150
- Partial case insensitive match (starts with) in project name: 100
- Partial case insensitive match (starts with) in project path: 100
- Project case insensitive path contains the search string: 50
- Project case insensitive description contains the search string: 10
- Note: For the partial matches, take into account the match size.
Summing the values would give us the score. If we order the results by score, we'll get more relevant projects on top.
Example query when the user searches for gitla
:
select name, path, ((CASE WHEN "projects"."name"='gitla' THEN 300 ELSE 0 END) +
(CASE WHEN "projects"."path"='gitla' THEN 300 ELSE 0 END) +
(CASE WHEN LOWER("projects"."name")=LOWER('gitla') THEN 150 ELSE 0 END) +
(CASE WHEN strpos(LOWER("projects"."name"), LOWER('gitla')) = 1 THEN 100 - ABS(LENGTH('gitla') - LENGTH("projects"."name")) ELSE 0 END) +
(CASE WHEN strpos(LOWER("projects"."path"), LOWER('gitla')) = 1 THEN 100 - ABS(LENGTH('gitla') - LENGTH("projects"."path")) ELSE 0 END) +
(CASE WHEN strpos(LOWER("projects"."name"), LOWER('gitla')) != 1 AND strpos(LOWER("projects"."name"), LOWER('gitla')) != 0 THEN 50 - ABS(LENGTH('gitla') - LENGTH("projects"."name")) ELSE 0 END) +
(CASE WHEN strpos(LOWER("projects"."description"), LOWER('gitla')) = 1 THEN 10 ELSE 0 END)) as score
FROM projects WHERE namespace_id = 9970 ORDER BY score DESC;
Top 5 records ordered by score:
name | path | score
----------------------------------------+----------------------------------------+-------
GitLab | gitlab | 208
GitLab-Git | gitlab-git | 200
GitLab FOSS | gitlab-foss | 198
GitLab Docs | gitlab-docs | 198
gitlab_emoji | gitlab_emoji | 196
Results for gitlab
:
name | path | score
----------------------------------------+----------------------------------------+-------
GitLab | gitlab | 660
GitLab-Git | gitlab-git | 202
GitLab FOSS | gitlab-foss | 200
GitLab Docs | gitlab-docs | 200
gitlab_emoji | gitlab_emoji | 198
Results for git
:
name | path | score
----------------------------------------+----------------------------------------+-------
Git | git | 660
gitaly | gitaly | 204
GitLab | gitlab | 204
GitLab-Git | gitlab-git | 196
GitLab Docs | gitlab-docs | 194
Results for pages
:
name | path | score
----------------------------------------+----------------------------------------+-------
gitlab-pages | gitlab-pages | 43
gitlab-pages-proto | gitlab-pages-proto | 37
gitlab-shell | gitlab-shell | 0
GitLab CI | gitlab-ci | 0
GitLab CI Runner | gitlab-ci-runner | 0
Performance
Query to filter projects within a group:
SELECT "projects".*
FROM "projects"
WHERE "projects"."namespace_id" IN
(WITH RECURSIVE "base_and_descendants" AS (
(SELECT "namespaces".*
FROM "namespaces"
WHERE "namespaces"."type" = 'Group'
AND "namespaces"."id" = 9970)
UNION
(SELECT "namespaces".*
FROM "namespaces",
"base_and_descendants"
WHERE "namespaces"."type" = 'Group'
AND "namespaces"."parent_id" = "base_and_descendants"."id")) SELECT "id"
FROM "base_and_descendants" AS "namespaces")
AND (EXISTS
(SELECT 1
FROM "project_authorizations"
WHERE "project_authorizations"."user_id" = 4156052
AND (project_authorizations.project_id = projects.id))
OR projects.visibility_level IN (0,
10,
20))
AND (("projects"."path" ILIKE '%gitl%'
OR "projects"."name" ILIKE '%gitl%')
OR "projects"."description" ILIKE '%gitl%')
AND "projects"."archived" = FALSE
AND "projects"."pending_delete" = FALSE
ORDER BY "projects"."name" ASC,
"projects"."id" ASC
LIMIT 50
OFFSET 0
8ms, Plan
Query to filter projects within a group with scoring:
SELECT "projects".*, ((CASE WHEN "projects"."name"='gitl' THEN 300 ELSE 0 END) +
(CASE WHEN "projects"."path"='gitl' THEN 300 ELSE 0 END) +
(CASE WHEN LOWER("projects"."name")=LOWER('gitl') THEN 150 ELSE 0 END) +
(CASE WHEN strpos(LOWER("projects"."name"), LOWER('gitl')) = 1 THEN 100 - ABS(LENGTH('gitl') - LENGTH("projects"."name")) ELSE 0 END) +
(CASE WHEN strpos(LOWER("projects"."path"), LOWER('gitl')) = 1 THEN 100 - ABS(LENGTH('gitl') - LENGTH("projects"."path")) ELSE 0 END) +
(CASE WHEN strpos(LOWER("projects"."name"), LOWER('gitl')) != 1 AND strpos(LOWER("projects"."name"), LOWER('gitl')) != 0 THEN 50 - ABS(LENGTH('gitl') - LENGTH("projects"."name")) ELSE 0 END) +
(CASE WHEN strpos(LOWER("projects"."description"), LOWER('gitl')) = 1 THEN 10 ELSE 0 END)) as score
FROM "projects"
WHERE "projects"."namespace_id" IN
(WITH RECURSIVE "base_and_descendants" AS (
(SELECT "namespaces".*
FROM "namespaces"
WHERE "namespaces"."type" = 'Group'
AND "namespaces"."id" = 9970)
UNION
(SELECT "namespaces".*
FROM "namespaces",
"base_and_descendants"
WHERE "namespaces"."type" = 'Group'
AND "namespaces"."parent_id" = "base_and_descendants"."id")) SELECT "id"
FROM "base_and_descendants" AS "namespaces")
AND (EXISTS
(SELECT 1
FROM "project_authorizations"
WHERE "project_authorizations"."user_id" = 4156052
AND (project_authorizations.project_id = projects.id))
OR projects.visibility_level IN (0,
10,
20))
AND (("projects"."path" ILIKE '%gitl%'
OR "projects"."name" ILIKE '%gitl%')
OR "projects"."description" ILIKE '%gitl%')
AND "projects"."archived" = FALSE
AND "projects"."pending_delete" = FALSE
ORDER BY score DESC
LIMIT 50
OFFSET 0
9ms, Plan
The sorting happens in the memory for both cases. Scoring is adds a slight overhead.
Implementation Plan
Note: Using elasticsearch would be better tool for the job, however at the moment elasticsearch is not 100% available within GitLab.
- Introduce a new sorting option to the project search API:
relevance
(defaults toDESC
order). - When the
relevance
sort option is given, add the scoring calculation to the query. - If no search text is given, fall back to order by name (
ASC
). - Mark the
relevance
sort option as experimental or interrnal for now in the docs.