Use a proper query parser for Advanced Search queries and field name prefixes
Problem
Today our searches are implemented using a combination of Regexes and Elasticsearch simple query string syntax. This already is creating a confusing user experience because the regex parts (eg. the filename: filters) are not actually being interpreted by the simple query string parser and therefore do not compose reliably with boolean filters though they can accidentally compose with implicit AND operator.
You can see some discussion about this in !48637 (comment 465730253) where we planned to add more regex based filters on top of our queries.
We can see examples of user confusion at #273162 (closed) and our documentation is not clear at all how/if there is any way to compose these filters with our boolean operators as a user might expect.
The main reason a user would expect these filter syntax to compose a certain way is that it's pretty much how all advanced search developer tooling behaves. Lots of examples including:
- Lucene query syntax
- Kibana query syntax
- Google's syntax (eg.
(site:gitlab.com OR site:docs.gitlab.com) AND elasticsearch) - Elasticsearch's string syntax
The standard is incredibly widespread and we half implemented it using regexes with no warning/error to users when our regexes aren't interpreting their syntax correctly.
Solution
I believe the only real way to implement this query syntax is to implement a parser of some kind. We can possibly using Elasticsearch's string syntax instead of Elasticsearch simple query string syntax but we need to consider the possible downsides which include the fact that query string syntax may return errors to users and simple query string never errors. We also need to consider any possible security or performance implications of allowing the user to have lower level control of their queries. It may open up to leaking data that the user should not have access to due to querying a field they should not be able to see OR even timing based attacks on querying fields documents they should not have access to.
If we did not use the Elasticsearch query parsers then we would need to implement one ourselves. Doing this may give us more control over what the user can and can't do but we should investigate multiple options here and weigh them up.