Skip to content

[EE] Support unlimited file search in web UI and API

Jan Provaznik requested to merge blob-count2-ee into master

What does this MR do?

This MR removes limit of 100 used for blob/wiki blob searches.

Because filename and content is done through gitaly request which returns all matches anyway, applying limit of 100 is not very effective (as most of the time is spent by doing content search on gitaly side) and introduces significant disadvantages to search usage:

  • only max 100 matches can be returned both in web UI and API - this is quite limiting especially for API
  • if additional filters are used (e.g. path:...), these are applied on the limited first 100 results which may provide incomplete (or even zero) set of matches

Changes in this MR:

  • removes limit of 100 and does pagination of all matches
  • removes sorting of filename and content matches together - now filename matches are listed first, then content matches (sorting is done already on gitaly side)
  • applies filters on all results (not only subset of results), "binary" utf string is used in filters now (running utf_encode on all results is too expensive)
  • FoundBlob class is moved into a separate file and extended, specifically fetching and parsing is done lazily - when some attribute is really requested - this allows us to use FoundBlob for not-paginated array of matches
  • instead of returning array of tuples [blob.filename, blob], only blob is returned now - there is no reason to pass the tuple
  • this change is specific to not-elasticsearch search - elasticsearch doesn't use this code

Performance impact

This change adds relatively small penalty to the search time. Major penalty is that now for each match a new instance of FoundBlob is initialized and filters are applied (if used in search string, which I think is not so often) on all matches. This overhead is marginal for thousands of matches. For big sets of matches, the overhead is still acceptable relatively to the time spent by grep.

Bellow are statistics done on linux repository when ten of thousands and hundreds of thousands of matches are returned.

Time spent in FileFinder.find - most of changes related to performance were done in this method:

search string w/o MR with MR
test (45 000 matches from grep) 3.57 3.69
test Documentation (45000 matches from grep) 3.54 3.86
ab (414 000 matches from grep) 6.0 6.6
ab Documentation (414 000 matches from grep) 6.2 8.66

Overall request time:

search string w/o MR with MR
test (45 000 matches from grep) 4137ms 4635ms
test Documentation (45000 matches from grep) 8765ms 9417ms
ab (414 000 matches from grep) 6324ms 7589ms
ab Documentation (414 000 matches from grep) 11350ms 14027ms

The huge 5s penalty for requests which use ...path in search string is unrelated - in this case commit count take much longer both with and w/o the MR.

What are the relevant issue numbers?

Closes https://gitlab.com/gitlab-org/gitlab-ce/issues/45915

Does this MR meet the acceptance criteria?

Edited by Rémy Coutable

Merge request reports