Change Elasticsearch `code` filter's path-separating pattern to a recursive one

In https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/5138 I added a new pattern_capture filter to separate paths (like/this/one) into the constituent tokens ([like, this, one]). I did this loosely based on the path_hierarchy tokenizer, but my implementation differs greatly from what path_tokenizer does.

Path hierarchy tokenizer (with reverse: true like we set it to) would turn like/this/one into the following tokens:

  • like/this/one
  • this/one
  • one

I'm opening this issue to discuss whether this behavior is better than the one that was merged (which would simply separate the terms).

The regex for this would be (?=\/+([\w\/\.]+)). It turns like/this/one into:

  • this/one
  • one

It doesn't add like/this/one because of the missing / at the beginning, but this is not important since the original token is still part of the list, so having it here would be redundant.

cc/ @nick.thomas and @vsizov what do you think?

Edited May 19, 2022 by Coung Ngo
Assignee Loading
Time tracking Loading