Use HTML5 parser to close tags left open

What does this MR do and why?

Fixes Inline <div> corrupts the MR page layout; maybe... (#594217 - closed). A teammate noticed in #is-this-known that it was possible to ruin the layout of a page depending on the HTML used in an MR description.

It turns out that Nokogiri::HTML's parser, which forms the basis of the whole backend pipeline, is libxml2-based and obeys the HTML4 standard.

This is a Problem. GitLab explicitly uses HTML5 (view source on this page and see <!DOCTYPE html> at the top). HTML4 lets you do cursed things which led directly to this problem:

Nokogiri::HTML.fragment('<ul><li>first <div></li><li>second <div></li></ul>').to_html
=> "<ul><li>first <div><li>second <div></div>\n</li></div>\n</li></ul>"

There's now an <li> nested within a <div> within an <li>! The effect of this when then dropped into an HTML5 page is unpredictable.

Using Nokogiri::HTML5 instead, based on Gumbo, instead produces this:

irb(main):013> Nokogiri::HTML5.fragment('<ul><li>first <div></li><li>second <div></li></ul>').to_html
=> "<ul><li>first <div></div></li><li>second <div></div></li></ul>"

Each <div> is immediately closed, the <li>s retain their proper place and siblingship. All is good in the hood.

¿What's all the rest of this MR then?

  • It turns out we should really have been using an HTML5 parser for a while: it would've saved us some bug bounty money.

    I note that GitHub have long (~10 years) used a Gumbo-based HTML5 parser for their HTML pipeline: it's an extremely validated choice.

  • The rest of the MR is basically spec adjustments, both for the above issues (the basic situation that led to them can no longer be produced, so we instead assert what they do parse as, as a way of keeping them from regressing), and for other filters which now have subtly different output.

    • A common one difference that the output prefers <tag attr="&quot;"> instead of <tag attr='"'>. These mean exactly the same thing.
    • Another common difference is newline changes. These should not affect the actual display in any way.
    • One spec forgot to escape its backslashes, so I've fixed that.
    • I moved some let(:project) to let_it_be(:project) (and similar) to speed up some specs.
  • Where possible I've preferred to change failing specs that asserted the output HTML is exactly some string, which is super brittle, to instead assert properties of the resulting DOM. This is the way we've been moving with these specs for a while now, so this advances us in a consistent direction

  • I've also changed the rather confusing name ConvertTextToDocFilter, which only runs on HTML and not text (!!!), to ParseHtmlFilter.

  • The libxml2-based HTML4 parser would refuse to nest elements beyond a depth of 256, and would just flatten elements out at that point. The Gumbo-based HTML5 parser has no such limit. We need to impose a limit because our HTML sanitising library naïvely recurses on the DOM, and will happily blow the stack on deeply nested documents. We choose a really, really high limit that nonetheless doesn't cause the stack to blow, and if that limit is breached, render an error instead of the document, in the same spirit as the TimeoutFilterHandler error message. (See BaseSanitizationFilter#returned_timeout_value, and other such overrides for prior art here.)

  • Updated EqHtmlMatcher to likewise use HTML5; the whole point of this matcher is to conform to the browser's interpretation.

Screenshots or screen recordings

Before After
image image

How to set up and validate locally

Because this affects the cached HTML for Markdown renders, you must modify affected fields to cause a rerender of the Markdown. Further, because this MR includes initialiser which introduces a patch, you must restart your GDK Rails instance when switching to or from this branch.

  1. Let's first verify the current behaviour, not on this branch. Go to an MR on your GDK and edit the description with the following content, using the plain-text editor:

    * first <div>
    * second <div>
    
    Something that should be outside these list items!

    You should see the page layout break, like the "Before" screenshot above.

  2. Check out this branch. You might want to run bundle, and yarn, and bin/rails db:migrate, all as part of a natural and healthy GDK branch switching habit. Most importantly, you must gdk restart rails-web. See the paragraph before this checklist for why.

  3. Navigate back to your MR. It'll still look broken, but slightly less so. You'll see that your "Something that should be outside these list items!" is still nested improperly (this is part of the cached render), but the page layout is no longer broken (because we reparse the cached render's HTML in the post-processing pipeline, which is now using HTML5).

  4. Edit the MR description back to the above Markdown, but change a word to something else somewhere in the input. We need to ensure we don't rely on a cached render anywhere.

  5. It should look 100% un-broken, like the "After" screenshot above!

  6. Don't forget to gdk restart rails-web when switching off the branch!

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Asherah Connor

Merge request reports

Loading