Improve code search matching for Advanced Search
Problem to solve
The current code analyzer for the Elasticsearch integration doesn't take all code cases into account. For example, if I have a file with a.b.c=one_two_three
and I search for one_two_three
, I don't get the file returned in the search.
We have improved the code analyzer in Elasticsearch and it is now it is more robust at matching searches with special characters.
Examples
Current Languages Impacted:
Language | Impacted Count |
---|---|
PHP | 4 |
Java | 3 |
Python | 2 |
XML | 2 |
C (makefile) | 1 |
Puppet | 1 |
Terraform | 1 |
Rust | 1 |
Clojure | 1 |
Lisp | 1 |
Markdown | 1 |
Ruby | 1 |
JSON | 1 |
C# DotNet | 1 |
Customer ticket initially reporting this issue: https://gitlab.zendesk.com/agent/tickets/116884
Important
UPDATE Code analyzer is defined in this file: https://gitlab.com/gitlab-org/gitlab/-/blob/5e78106230570f0eea4396de76eee357ba40cfac/ee/lib/elastic/latest/config.rb#L54
The code is tokenized on whitespace, and then those tokens are put through various filters, including the custom code filter:
"code": {
"type": "pattern_capture",
"preserve_original": "true",
"patterns": [
"(\\p{Ll}+|\\p{Lu}\\p{Ll}+|\\p{Lu}+)",
"(\\d+)",
"(?=([\\p{Lu}]+[\\p{L}]+))",
"\"((?:\\\"|[^\"]|\\\")*)\"",
"'((?:\\'|[^']|\\')*)'",
"\\.([^.]+)(?=\\.|\\s|\\Z)",
"\\/?([^\\/]+)(?=\\/|\\b)"
]
},
The current custom patterns account for quotes, periods, and path terms. They don't account for other special characters, like equal signs in this case. I added the following pattern to the code filter to account for equal signs, and I was able to search for this file successfully after reconfiguring and reindexing:
'\=?([^=]+)(?=\=|\b)' # separate terms on equal signs
Steps to reproduce
On an instance with the Elasticsearch integration, create a project with 4 files that contain the following:
- fileA.md:
one_two_three=a.b.c
- fileB.md:
a.b.c=one_two_three
- fileC.md:
one_two_three = a.b.c
- fileD.md:
a.b.c = one_two_three
Search for one_two_three
, and you'll get files A, C, and D returned but not file B.
Example Project
This cannot be reproduced on GitLab.com because GitLab.com doesn't use Elasticsearch at this time.
What is the current bug behavior?
The search with the Elasticsearch integration doesn't return the correct results because the custom code filter doesn't take into account various code cases, including special characters with no whitespace.
Special characters not supported (customer found):
= # equal signs
, # commas
( # open parentheses
) # close parentheses
:: # double colon
: # colon
-> # arrow
What is the expected correct behavior?
The Elasticsearch should try to account for various code cases, including this specific case where there are equal signs with no whitespace.
Output of checks
Results of GitLab environment info
Expand for output related to GitLab environment info
System information System: Ubuntu 16.04 Proxy: no Current User: git Using RVM: no Ruby Version: 2.5.3p105 Gem Version: 2.7.6 Bundler Version:1.16.6 Rake Version: 12.3.2 Redis Version: 3.2.12 Git Version: 2.18.1 Sidekiq Version:5.2.5 Go Version: unknown
GitLab information Version: 11.9.1-ee Revision: b50dc44 Directory: /opt/gitlab/embedded/service/gitlab-rails DB Adapter: postgresql DB Version: 9.6.11 URL: http://198.199.92.126 HTTP Clone URL: http://198.199.92.126/some-group/some-project.git SSH Clone URL: git@198.199.92.126:some-group/some-project.git Elasticsearch: yes Geo: no Using LDAP: yes Using Omniauth: yes Omniauth Providers: saml, group_saml
GitLab Shell Version: 8.7.1 Repository storage paths:
- default: /var/opt/gitlab/git-data/repositories GitLab Shell path: /opt/gitlab/embedded/service/gitlab-shell Git: /opt/gitlab/embedded/bin/git
Results of GitLab application Check
Expand for output related to the GitLab application check
Checking GitLab subtasks ...
Checking GitLab Shell ...
GitLab Shell: ... GitLab Shell version >= 8.7.1 ? ... OK (8.7.1) Running /opt/gitlab/embedded/service/gitlab-shell/bin/check Check GitLab API access: OK Redis available via internal API: OK
Access to /var/opt/gitlab/.ssh/authorized_keys: OK gitlab-shell self-check successful
Checking GitLab Shell ... Finished
Checking Gitaly ...
Gitaly: ... default ... OK
Checking Gitaly ... Finished
Checking Sidekiq ...
Sidekiq: ... Running? ... yes Number of Sidekiq processes ... 1
Checking Sidekiq ... Finished
Checking Incoming Email ...
Incoming Email: ... Reply by email is disabled in config/gitlab.yml
Checking Incoming Email ... Finished
Checking LDAP ...
LDAP: ... Server: ldapmain LDAP authentication... Failed. Check
bind_dn
andpassword
configuration values LDAP users with access to your GitLab server (only showing the first 100 results) Server: ldapsecondary LDAP authentication... Failed. Checkbind_dn
andpassword
configuration values LDAP users with access to your GitLab server (only showing the first 100 results)Checking LDAP ... Finished
Checking GitLab App ...
Git configured correctly? ... yes Database config exists? ... yes All migrations up? ... yes Database contains orphaned GroupMembers? ... no GitLab config exists? ... yes GitLab config up to date? ... yes Log directory writable? ... yes Tmp directory writable? ... yes Uploads directory exists? ... yes Uploads directory has correct permissions? ... yes Uploads directory tmp has correct permissions? ... yes Init script exists? ... skipped (omnibus-gitlab has no init script) Init script up-to-date? ... skipped (omnibus-gitlab has no init script) Projects have namespace: ... 46/1 ... yes 46/2 ... yes 46/3 ... yes 46/4 ... yes 47/5 ... yes 47/6 ... yes 47/8 ... yes 47/9 ... yes 47/10 ... yes 47/11 ... yes 48/12 ... yes 48/13 ... yes 48/14 ... yes 48/15 ... yes 48/16 ... yes 48/17 ... yes 49/18 ... yes 49/19 ... yes 49/20 ... yes 49/21 ... yes 49/22 ... yes 49/23 ... yes 49/24 ... yes 49/25 ... yes 49/26 ... yes 49/27 ... yes 50/28 ... yes 50/29 ... yes 50/30 ... yes 50/31 ... yes 50/32 ... yes 50/33 ... yes 50/34 ... yes 51/35 ... yes 51/36 ... yes 51/37 ... yes 51/38 ... yes 51/39 ... yes 51/40 ... yes 51/41 ... yes 51/42 ... yes 51/43 ... yes 52/44 ... yes 52/45 ... yes 52/46 ... yes 52/47 ... yes 52/48 ... yes 52/49 ... yes 52/50 ... yes 52/51 ... yes 53/52 ... yes 53/53 ... yes 53/54 ... yes 53/55 ... yes 53/56 ... yes 53/57 ... yes 53/58 ... yes 53/59 ... yes 54/60 ... yes 54/61 ... yes 54/62 ... yes 55/63 ... yes 55/64 ... yes 56/65 ... yes 56/66 ... yes 56/67 ... yes 57/68 ... yes 57/69 ... yes 57/70 ... yes 57/71 ... yes 57/72 ... yes 57/73 ... yes 58/74 ... yes 58/75 ... yes 58/76 ... yes 58/77 ... yes 58/78 ... yes 59/79 ... yes 59/80 ... yes 59/81 ... yes 59/82 ... yes 59/83 ... yes 59/84 ... yes 59/85 ... yes 60/86 ... yes 60/87 ... yes 60/88 ... yes 60/89 ... yes 60/90 ... yes 1/91 ... yes 60/92 ... yes 1/93 ... yes 83/94 ... yes 83/95 ... yes 83/96 ... yes 84/97 ... yes 84/98 ... yes 84/99 ... yes 84/100 ... yes 85/101 ... yes 85/102 ... yes 85/103 ... yes 86/104 ... yes 86/105 ... yes 86/106 ... yes 86/107 ... yes 86/108 ... yes 87/109 ... yes 87/110 ... yes 88/111 ... yes 88/112 ... yes 88/113 ... yes 89/114 ... yes 89/115 ... yes 90/116 ... yes 90/117 ... yes 90/118 ... yes 90/119 ... yes 91/120 ... yes 91/121 ... yes 91/122 ... yes 91/123 ... yes 91/124 ... yes 98/125 ... yes 91/126 ... yes 110/127 ... yes 60/128 ... yes 60/129 ... yes 60/130 ... yes 1/131 ... yes 1/132 ... yes 1/133 ... yes 1/134 ... yes 1/135 ... yes 1/136 ... yes 1/137 ... yes 60/138 ... yes 111/139 ... yes 1/140 ... yes 1/141 ... yes 1/142 ... yes 104/143 ... yes 1/144 ... yes 1/145 ... yes 1/146 ... yes 1/147 ... yes 114/148 ... yes 1/149 ... yes 1/150 ... yes 117/151 ... yes 118/152 ... yes 119/153 ... yes 1/154 ... yes 60/155 ... yes Redis version >= 2.8.0? ... yes Ruby version >= 2.3.5 ? ... yes (2.5.3) Git version >= 2.18.0 ? ... yes (2.18.1) Git user has default SSH configuration? ... yes Active users: ... 75 Elasticsearch version 5.6 - 6.x? ... yes (6.6.1)
Checking GitLab App ... Finished
Checking GitLab subtasks ... Finished
Possible fixes
We can add additional filters here: https://gitlab.com/gitlab-org/gitlab/-/blob/5e78106230570f0eea4396de76eee357ba40cfac/ee/lib/elastic/latest/config.rb#L54
We can also look into other ES token filters: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenfilters.html
Is is tough to take into account all code cases. I think the current implementation does a good job, with the exception of these corner cases.