Skip to content

Fuzzy search issues / merge requests

Hiroyuki Sato requested to merge hiroponz/gitlab-ce:fuzzy-issue-search into master

What does this MR do?

This MR resolves the following problem.

When performing a multi-word issue search, we appear to be looking for the exact string. e.g. "smart deploy" must match > both words, in that order, consecutively. But I want it to find any issue with "smart" and "deploy" in any order. e.g. > "deploy smart" or "smart fuzz deploy".

Example from comment:

The search term foo "really bar" baz would return results with:

  • foo fuzz really bar fuzz baz
  • foo fuzz baz fuzz really bar
  • really bar fuzz foo fuzz baz
  • really bar fuzz baz fuzz foo
  • baz fuzz foo fuzz really bar
  • baz fuzz really bar fuzz foo

For performance reason, the words shorter than 3 chars is ignored. This problem exists in GitLab.com now, but not noticeable.

I have tested to Issues API by curl.

1 char -> X-Runtime: 47.466618 % time curl -I --header "PRIVATE-TOKEN: xxx" "https://gitlab.com/api/v4/issues?scope=all&search=a" satouhiroyuki@satou-no-MacBook-Air HTTP/1.1 200 OK Server: nginx Date: Wed, 30 Aug 2017 13:43:23 GMT Content-Type: application/json Content-Length: 58696 Cache-Control: no-cache Link: ; rel="next", ; rel="first", ; rel="last" Vary: Origin X-Frame-Options: SAMEORIGIN X-Next-Page: 2 X-Page: 1 X-Per-Page: 20 X-Prev-Page: X-Request-Id: 23f26aa2-a19f-4a45-b915-288349b90601 X-Runtime: 47.466618 X-Total: 326873 X-Total-Pages: 16344 Strict-Transport-Security: max-age=31536000

curl -I --header "PRIVATE-TOKEN: xxx" 0.03s user 0.02s system 0% cpu 51.296 total

2 chars -> 502 Bad Gateway $ time curl -I --header "PRIVATE-TOKEN: xxx" "https://gitlab.com/api/v4/issues?scope=all&search=aa" satouhiroyuki@satou-no-MacBook-Air HTTP/1.1 502 Bad Gateway Server: nginx Date: Wed, 30 Aug 2017 13:42:09 GMT Content-Type: text/plain Content-Length: 24

curl -I --header "PRIVATE-TOKEN: xxx" 0.03s user 0.02s system 0% cpu 1:03.42 total

3 chars -> X-Runtime: 8.509205 % time curl -I --header "PRIVATE-TOKEN: xxx" "https://gitlab.com/api/v4/issues?scope=all&search=aaa" satouhiroyuki@satou-no-MacBook-Air HTTP/1.1 200 OK Server: nginx Date: Wed, 30 Aug 2017 13:43:37 GMT Content-Type: application/json Content-Length: 60599 Cache-Control: no-cache Link: ; rel="next", ; rel="first", ; rel="last" Vary: Origin X-Frame-Options: SAMEORIGIN X-Next-Page: 2 X-Page: 1 X-Per-Page: 20 X-Prev-Page: X-Request-Id: 759f61b1-2305-4b3f-a15f-7921c18ef407 X-Runtime: 8.509205 X-Total: 1462 X-Total-Pages: 74 Strict-Transport-Security: max-age=31536000

curl -I --header "PRIVATE-TOKEN: xxx" 0.03s user 0.01s system 0% cpu 9.771 total

Are there points in the code the reviewer needs to double check?

  • SQL performance

The following is the EXPLAIN ANALYSE output made by Issue.full_search("foo bar").

gitlabhq_development=# SELECT COUNT(*) FROM issues; count ------- 12333 (1 row)

gitlabhq_development=# gitlabhq_development=# EXPLAIN ANALYSE SELECT "issues".* FROM "issues" WHERE "issues"."deleted_at" IS NULL AND ("issues"."title" ILIKE '%foo%' AND "issues"."title" ILIKE '%bar%' OR "issues"."description" ILIKE '%foo%' AND "issues"."description" ILIKE '%bar%') ORDER BY "issues"."id" DESC; QUERY PLAN

Sort (cost=52.03..52.04 rows=1 width=347) (actual time=0.053..0.053 rows=0 loops=1) Sort Key: id DESC Sort Method: quicksort Memory: 25kB -> Bitmap Heap Scan on issues (cost=48.00..52.02 rows=1 width=347) (actual time=0.042..0.042 rows=0 loops=1) Recheck Cond: ((((title)::text ~~* '%foo%'::text) AND ((title)::text ~~* '%bar%'::text)) OR ((description ~~* '%foo%'::text) AND (description ~~* '%bar%'::text))) Filter: (deleted_at IS NULL) -> BitmapOr (cost=48.00..48.00 rows=1 width=0) (actual time=0.040..0.040 rows=0 loops=1) -> Bitmap Index Scan on index_issues_on_title_trigram (cost=0.00..24.00 rows=1 width=0) (actual time=0.024..0.024 rows=0 loops=1) Index Cond: (((title)::text ~~* '%foo%'::text) AND ((title)::text ~~* '%bar%'::text)) -> Bitmap Index Scan on index_issues_on_description_trigram (cost=0.00..24.00 rows=1 width=0) (actual time=0.016..0.016 rows=0 loops=1) Index Cond: ((description ~~* '%foo%'::text) AND (description ~~* '%bar%'::text)) Planning time: 0.694 ms Execution time: 0.110 ms (13 rows)

Why was this MR needed?

It is difficult to find issues by multi-word query.

Does this MR meet the acceptance criteria?

What are the relevant issue numbers?

Closes #26835 (closed), #29994 (closed), #20362 (closed)

Edited by Toon Claes

Merge request reports