Improve performance of Banzai::Filter::ReferenceFilter (!34741) · Merge requests · GitLab.org / GitLab

Problem

From the flame graph we can see that it takes 38.2s to load, and Banzai::Filter::ReferenceFilter#each_node looked like the biggest offender (7.9-11.7s <-> 25-30% of time) and a good candidate to investigate.

It turned out that Banzai::Filter::ReferenceFilter#each_node is creating and iterating 22346 Nokogiri::XML::Text objects for this CHANGELOG.md file.

I discovered that there were a lot of empty lines, so with skipping empty lines, we reduced the number to 11098 for this particular file.

It turned out that we have various Banzai filters, such as:

Banzai::Filter::IterationReferenceFilter
Banzai::Filter::UserReferenceFilter
Banzai::Filter::ProjectReferenceFilter
Banzai::Filter::IssueReferenceFilter
Banzai::Filter::MergeRequestReferenceFilter
Banzai::Filter::SnippetReferenceFilter
Banzai::Filter::CommitRangeReferenceFilter
Banzai::Filter::CommitReferenceFilter

We are parsing the same doc and filtering the same nodes over and over again (most of the time).

If these filters don't find any references and they pass to the next filter unchanged doc, that we are filtering again. In case when they actually replace the text with an actual link, doc changes, and those nodes should reflect those changed nodes as well.

Example:

Let say that we have the following text:

 !122 #12323 @root .`

This text will be parsed to

[<Nokogiri::XML::Text "!122 #1 @root .">]

`ReferenceMergeRequestFilter`

will parse the doc using each_node and iterate through document searching for project_reference_pattern (!merge_request_id), and it will replace this node with 2 new nodes:

[<Nokogiri::XML::Element name="a" attributes=[<Nokogiri::XML::Attr name="href" value="../merge_requests/122">, ...>,

<Nokogiri::XML::Text " #12323 @root .">]

`ReferenceIssuesFilter`

will parse the doc using each_node again and iterate through the document searching for issue_reference_pattern (#issue_id), and it will replace the new node:

[<Nokogiri::XML::Text " #12323 @root .">]

with:

[<Nokogiri::XML::Element name="a" attributes=[<Nokogiri::XML::Attr name="href" value="../issues/12323">, ...>,

<Nokogiri::XML::Text " @nmilojevic1 .">]

And the actual document will now look like this:

[ <Nokogiri::XML::Element name="a" attributes=[<Nokogiri::XML::Attr name="href" value="../merge_requests/122">, ...>, 

<Nokogiri::XML::Element name="a" attributes=[<Nokogiri::XML::Attr name="href" value="../issues/12323">, ...>,

<Nokogiri::XML::Text " @nmilojevic1 .">]

`ReferenceUserFilter`

will parse the doc using each_node again and iterate through the document searching for user reference pattern (@user_name), and it will replace the new node:

[<Nokogiri::XML::Text " @nmilojevic1 .">]

with:

[<Nokogiri::XML::Element name="a" attributes=[<Nokogiri::XML::Attr name="href" value="../users/nmilojevic1">, ...>,

<Nokogiri::XML::Text " .">]

And the actual document will now look like this:

[ <Nokogiri::XML::Element name="a" attributes=[<Nokogiri::XML::Attr name="href" value="../merge_requests/122">, ...>, 

<Nokogiri::XML::Element name="a" attributes=[<Nokogiri::XML::Attr name="href" value="../issues/12323">, ...>,

<Nokogiri::XML::Element name="a" attributes=[<Nokogiri::XML::Attr name="href" value="../users/nmilojevic1">, 

<Nokogiri::XML::Text " .">]

When we parse the whole (big) doc each time using each_node for each reference_filter, it searches for text and unreferenced links through 22.000 lines, each time.

What does this MR tries to do?

When we replace the node, we update the list with replaced node (keeping text and omitting referenced links), and we pass that updated node list to the next filter.

Feature Flags

This optimization is under the feature flag:

:update_nodes_for_banzai_reference_filter (with project scope)

Rollout plan

We should enable Feature Flage for a few projects and test to make sure it works.
Enable FF for Gitlab project
Prepare MR to enable if for default for everyone
Update documentation
Prepare MR to remove feature flag

Conformity

Availability and Testing

Tested with Gitlabhq project, CHANGELOG.md file

https://staging.gitlab.com/qa-perf-testing/gitlabhq/-/preview/master/CHANGELOG.md

master branch

217580-improve-performance-of-blob branch

branch	method	time	total time
master	Banzai::Filter::ReferenceFilter#each_node	7.9s	34.7s
origin/217580-improve-performance-of-blob	Banzai::Filter::ReferenceFilter#each_node	0.574s	19.82s

In this particular case, this File is loading 75% faster than before.

Banzai::Filter::ReferenceFilter#each_node is 13x faster then before

I tested File preview as well:

branch	total time s	Banzai::Filter::ReferenceFilter#each_node
master	27.6s	6.00s (22%)
217580-improve-performance-of-blob	15.53s	0.370s (2.4%)

We can see similar improvements as we have with blob controller.

The issue preview is affected as well.

Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process.
Tested in all supported browsers
Informed Infrastructure department of default or new setting change, if applicable per definition of done

Related to #217580 (closed)

Edited Jul 14, 2020 by Nikola Milojevic

Improve performance of Banzai::Filter::ReferenceFilter