Fix the scroll position in code search results
What does this MR do?
The previous algorithm was a heuristic approach which looked for a term inside a highlighted fragment. The term in the fragment was then used again to find the matching line in the actual content. This ocassionally gave incorrect results where whatever happened to appear to be highlighted was also found earlier in the document written in a different way which was not intended to be a match.
The Elasticsearch
highlighter
supports setting number_of_fragments
to 0
which means that the
highlighted result will actual be the entire content itself rather than
small fragments of content. This makes it much easier to figure out the
line number since we can just loop through this to begin with and stop
as soon as we find the opening highlight tag.
This does come at the cost that all docs are returned in the highlight section now which makes the Elasticsearch response payload approximately twice as large but it is the only correct way to do it that I could find.
An alternative described in the
docs
is to use boundary_scanner
but this requires the Fast vector
highlighter
which in term requires us to set offsets
for our
index_options
but this will use considerably more storage so we would like to avoid
this if possible.
It's worth noting that this won't fix the fact that highlighting is not behaving properly in the below examples. For that I've created #254941 (closed) which is explaining a very similar problem to this.
Screenshots
Below shows examples with the specific files linked that produced the problem.
buffer.c
Before
After
features.yml
Before
After
Does this MR meet the acceptance criteria?
Conformity
-
Changelog entry - [-] Documentation (if required)
-
Code review guidelines -
Merge request performance guidelines -
Style guides -
Database guides -
Separation of EE specific content
Availability and Testing
-
Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process. -
Tested in all supported browsers -
Informed Infrastructure department of a default or new setting change, if applicable per definition of done
Security
If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:
-
Label as security and @ mention @gitlab-com/gl-security/appsec
-
The MR includes necessary changes to maintain consistency between UI, API, email, or other methods -
Security reports checked/validated by a reviewer from the AppSec team