Clearly delineate text and HTML in Banzai, and use helpers consistently to prevent XSS (!207397) · Merge requests · GitLab.org / GitLab

This MR was originally called "Ensure reference HTML is not unescaped in data-original", as it intended to fix a single bug. It had the following introduction:

Fixes CommonMark hyperlink labels, containing code Co... (#572761) by moving the security fix to the root cause.

Builds on Compose HTML consistently in Banzai reference f... (!208270 - merged) as there was no way to get consistent results with changes to this area without fixing the foundation.

!208270 (merged) was extracted from this MR because there was a never-ending thread of "fixing this issue causes new specs to fail because of questionable behaviour elsewhere; fixing those issues in turn cause even more", so I worked to extract what I could to prevent a mega-mega large MR.

Unfortunately, while I was then able to get a fix for just the one issue above going in this MR, there would be one or two more specs failing, which I then tried to minimally fix, but would fault one or two more ...

As it turned out, the rest of the issues are essentially interconnected, and so this became one reasonably large (+844 -485) MR that does a system-wide cleanup of Banzai's reference filters. The original explanation that accompanied the MR in its above form is included at the end of the MR description; what follows here is a description of the MR as a whole.

It's worth noting that I've already asked AppSec to give this a once-over last week. There have been changes since, but the overall direction is unchanged: keep text and HTML separate, note what variables are which, and make the transition points clear.

The filters have grown over many, many years. There has never been a comprehensive audit which made it clear which functions were are taking or returning plain text — which should not ever be treated as HTML without correct escaping — and which were taking or returning HTML, which is always assumed to be safe by the time it reaches these filters, as we are past the sanitisation stage.

The lack of clarity permeated the filters to such an extent that sometimes methods would receive text and sometimes return text, or sometimes HTML, depending on various conditions. Some functions received maybe-text-or-HTML from other functions, and then only used it depending on certain conditions, making it extremely unobvious when there would be a possible escalation of privilege (by treating text as HTML). These all set the stage for a stream of XSS issues which we've traditionally patched in a bandaid fashion. This MR instead seeks to clarify the data flow within the reference filters so we can have perfect injection resistance against XSS, as indeed, there is no reason why we shouldn't have this.

The MR looks big, but it contains many similar changes to various filters, since we can't fix one without fixing them all due to the nature of shared specs. Here I give a run-down of what to expect in the diff:

#references_in takes text as its first input, but it would sometimes then unescape that. This is a one-way street to injection. This should never be necessary, since the input is indeed text — either a text node's content, or an href attribute's value — and not HTML, or HTML-escaped text. This was only done in a few newer filters; any issues arising from this need to be corrected upstream of the filters, as this paves the way for future XSS.
- #references_in used to yield such unescaped text, which would then find its way to the data-original attribute of a replacement link, which occasionally would be substituted back in, creating XSS. We now do no unescaping of text, and we only yield the same data we matched, which is always plain text and now always treated as such.
#references_in had a variety of implementations copy-pasted around, some which were more or less safe than others. All implementations now consistently use #replace_references_in_text_with_html to safely replace matches in text with HTML, producing safe HTML which is a combination of escaped user text and HTML produced by us.
- #replace_references_in_text_with_html supports String#gsub, Gitlab::Utils::Gsub.gsub_with_limit, and Gitlab::UntrustedRegexp#replace_gsub as the underlying replacement mechanism. (Any Enumerator, actually.)
- In many cases the original implementation did a plain gsub replacement on text, substituting in HTML for valid matches, and then returned the whole part-text part-HTML string, which then got treated as HTML — essentially unescaping whatever else was in the text node and so opening XSS vectors. A apt example of this was VulnerabilityReferenceFilter (!).
In quite a few places, the variable name text was used when it in fact held HTML. This quite likely contributed to the confusion that produced vulnerabilities and has been addressed.
Likewise, link_content has been renamed in some places to link_text or link_content_html as appropriate.
Likewise, match has been renamed to match_text where appropriate, and match_data when it's a MatchData.
Extensive doc comments have been written to document the expected input and promised output of many functions and blocks. In particular, strings are described as text or HTML depending on which side of the divide they live on.
ReferenceFilter#replace_text_when_pattern_matches has been renamed to #replace_node_when_text_matches to better describe its function.
AbstractReferenceFilter#object_link_filter used to use #references_in to replace text with HTML, unless the block returned text identical to the match text, in which case it would not replace the text with HTML. If you returned slightly modified text, however, it would open an XSS. This "double-standard" of how the return value was treated has been removed; return HTML, or return nil to not do a replacement.
- This same issue has been fixed by the same change throughout the whole system: we do not ever "treat a return value as HTML, unless it returns the same text as the input text, in which case it's treated as text" (usually to signal "no replacement"). This made writing the producer functions excessively confusing, since we seemed to be able to yield HTML or text and have them both "just work" for the most part. Now, if a function yields or returns HTML, it always yields or returns HTML, and using nil to signal "no replacement" is used comprehensively, particularly where the string changes from text to HTML through the process.
- I think this accounts for a lot of the confusion of the codebase pre- this MR.
- I've also removed some levels of nesting with early returns in this particular function (which we should really encourage here, now that the default action is to safely not do any replacement). Consider viewing the diff without whitespace changes to see what's actually changed.
ReferenceFilter#replace_text_with_html actually just replaces the given node with the given HTML. It has been renamed to #replace_node_with_html; it doesn't do anything special about text.
- It does a lot of odd-looking work which will be addressed in Draft: Simplify complicated node replacement logic (!209419) (pending an upstream MR).
ReferenceFilter#replace_link_node_with_href would just call #replace_text_with_html (now #replace_node_with_html), iff the given HTML (taken from a block, but not for a clear reason beyond making specs easier to write) was not equal to the link parameter given initially. Comparing HTML to text like this a bad idea, and the separation between the name of the function, "replace link node with href", with its actual behaviour, "replace node with yielded HTML unless the yielded HTML equals the link parameter", did not aid in understanding it at all. It has been removed.
- The reason it behaved this way is due to the assumptions mentioned above, where some functions would return HTML or the original unchanged input text. If the return result was the unchanged input text, it would signal "do not replace", albeit in a shaky way that led to vulnerabilities. The same functions now return nil if no replacement is to be made, which means we change code like this:
```
if ref_pattern_start.match?(link)
  replace_link_node_with_href(node, index, link) do
    object_link_filter(link, ref_pattern_start, link_content: inner_html)
  end
end
```
  To instead read as follows:
```
if ref_pattern_start.match?(link)
  html = object_link_filter(link, ref_pattern_start, link_content_html: inner_html)
  replace_node_with_html(node, index, html) if html
end
```
  The former took me quite a while to understand. The latter hopefully is easier for anyone else to understand.
ReferenceFilter#replace_link_node_with_text is the same as above, except it would replace the node with the given HTML (again from a block) iff the yielded HTML was not equal to the node's text content. As above, this was due to the signalling behaviour, and has been replaced with conditional calls as in the above example.
ReferenceFilter#yield_valid_link used to yield the link href and the node's inner HTML; we would then compare the inner HTML to the href, and do some replacement iff they were equal. This wouldn't correctly match if there were entities involved, probably barring these from ever matching an existing link with an ampersand & in it, for example (since the href would show & but the inner HTML &). We now also yield the inner text, and use it as the basis of comparison.
- Note this was accidentally protective, as the old code used to hand inner_html to object_link_filter as its text parameter (!). Any actual entities would be treated as text, causing e.g. & to become visible on the page in certain (frankly contrived) circumstances where the percent-decoded link contained entities.
escape_html_entities and unescape_html_entities have no callers and are removed.
Similarly, unescape_link has been inlined into its single call-site.
- Hiding the implementation details in these cases hides crucial information about what conversions are actually taking place, and obscure safety implications of their use. Particularly where there's only a single call-site, having the abstraction raises more questions than it answers.
- This was separated out in d4f0c229 as VulnerabilityReferenceFilter overrode it to do no unescaping when the input matched the Vulnerability reference pattern, which matches text like [vulnerability:123], or in the most complicated example, [vulnerability:/gitlab-org/ruby/gems/gitlab-glfm-markdown/456]. There are no characters permissible in a matching vulnerability reference which would be affected by Addressable::URI.unescape (or CGI.unescape for that matter — neither % nor + is permitted), so we don't need to account for this override and can remove it.
The extra escaping of original_content in ReferenceRedactor is removed, as it is now safe to do so. See the below discussion for details on that.
Gitlab::UntrustedRegexp#replace_gsub and Gitlab::Utils::Gsub.gsub_with_limit are extended to be usable as an Enumerator.

About half of the changed lines in this MR are spec changes.

eq_html and include_html are introduced and used to make comparisons more semantic (i.e. assert "do these two strings have the same meaning to a web browser/HTML parser?", rather than "does this string equal that one when it's escaped? or unescaped?"), and less liable to accidentally assert something other than intended.
- These work by parsing both strings as HTML with Nokogiri, sorting the attributes in every element, and then serialising back to HTML. This ensures that e.g. x > y and x > y compare as equal, as do <a id=1 href="x"> and <a href='x' id="1"> (and they are!), while <b> and <b> do not.
- The matcher has its own specs to assert its behaviour.
All uses of escape_once or html_escape_once I've come across are removed; it is never good to use.
The ReferenceFilter#references_in shared examples assert that every implementer implements it safely, with no XSS introduction through confusion of text and HTML.
Similarly, a reference which does not unescape its content in data-original ensures any original HTML content cached in the data-original attribute (for later use by ReferenceRedactor) preserves the content exactly as entered.
- The # This is probably bad. comment refers to the fact that certain classes unescape content on input. We assert this so we know why the spec groups break (and why their behaviour is what it is right now) when future MRs which address these issues are written or rebased (there's an open MR for this at Stop unescaping HTML in BaseLabel#title=, #desc... (!207594) which I'll be progressing after this one).
The service desk mailer spec asserted that the @all text would get removed from the email when disable_all_mention was disabled. The text now remains in the email either way.
- This was almost certainly an accidental outcome of several unfortunate circumstances combining in UserReferenceFilter; we'd call link_to_all(link_content: link_content) when the @all reference matched and the disable_all_mention feature was disabled.
- link_to_all in turn would return plain link_content if the author was found not to be a team member.
- link_content would be nil in these cases, because @all wasn't originally a link to begin with! So it'd just be quietly removed from the text in such cases. The correct behaviour is to just leave it alone lol, which returning nil here now does.
The timeline events spec now doesn't make an assertion on the exact HTML string, since we may wish to change the order of attributes without breaking tests (as has happened in this case). We now use eq_html for the body part of the response, which asserts the HTML has the same meaning, and then assert the rest the same.

Below follows a detailed discussion of the data-original problem which led to all this untying of knots. This zooms in very closely on a vulnerability that disclosed to and fixed by us some years back, a bug report that eventually was tied to that fix, and how we move the security fix to the underlying root cause to ultimately fix the bug. That root cause wasn't addressable without fixing the whole reference filter system, (un)fortunately!

Discussion of the data-original problem and the security patch

https://gitlab.com/gitlab-org/security/gitlab/-/merge_requests/2275 escaped all data-original contents before using it in ReferenceRedactor, but this leads to legitimate original HTML content being escaped again; see #572761.

The actual root cause is LabelReferenceFilter#references_in and other similar overrides called by AbstractReferenceFilter#object_link_filter; these overrides match on de-escaped text, and then re-escape the text when returning to their callers --- but they don't re-escape when yielding. The yielded match eventually becomes data-original, causing HTML content such as <script> to be yielded as <script>. This then becomes the attribute value data-original="<script>", indistinguishable from legitimate HTML content such as <code> becoming data-original="<code>".

This MR instead has us not unescape and re-escape the entire text subject to #references_in — instead, we only unescape matched text when extracting names for lookup in the database. This avoids the issue entirely.

After this change, <script> correctly roundtrips (through data-original="&lt;script&gt;") and remains harmless, but <code> (and other safe HTML produced/sanitised by us already) now also correctly roundtrips (through data-original="<code>").

The same change is applied to all filters which use this technique to supply data back to AbstractReferenceFilter#object_link_filter, with a shared spec applied to each to prevent regressions in safety here.

The spec modified in https://gitlab.com/gitlab-org/security/gitlab/-/merge_requests/2275 is notable in that, at that time, it was using direct production of HTML through string formatting. This should be avoided at all costs. Note this part:

doc = Nokogiri::HTML.fragment("<a class='gfm' href='https://www.gitlab.com' data-reference-type='issue' data-original='#{original_content}'>bar</a>")

The spec (before the MR) used this definition of original_content:

let(:original_content) { '<code>foo</code>' }

The MR changed it to this:

let(:original_content) { '&lt;script&gt;alert(1);&lt;/script&gt;' }

It's important to note that, in terms of the document constructed by the call to Nokogiri::HTML.fragment, these are roughly equivalent in terms of the kind of HTML produced --- this may not be obvious at first glance. See this IRB session:

[35] pry(main)> original_content = '<code>'
=> "<code>"

[36] pry(main)> puts Nokogiri::HTML.fragment("<a data-original='#{original_content}'>bar</a>").to_html
<a data-original="&lt;code&gt;">bar</a>
=> nil

[37] pry(main)> original_content = '&lt;script&gt;alert(1);&lt;/script&gt;'
=> "&lt;script&gt;alert(1);&lt;/script&gt;"

[38] pry(main)> puts Nokogiri::HTML.fragment("<a data-original='#{original_content}'>bar</a>").to_html
<a data-original="&lt;script&gt;alert(1);&lt;/script&gt;">bar</a>
=> nil

Note also the logical value of the data-original attribute, in terms of the DOM:

[39] pry(main)> original_content = '<code>'
=> "<code>"

[40] pry(main)> puts Nokogiri::HTML.fragment("<a data-original='#{original_content}'>bar</a>").css("a")[0]["data-original"]
<code>
=> nil

[41] pry(main)> original_content = '&lt;script&gt;alert(1);&lt;/script&gt;'
=> "&lt;script&gt;alert(1);&lt;/script&gt;"

[42] pry(main)> puts Nokogiri::HTML.fragment("<a data-original='#{original_content}'>bar</a>").css("a")[0]["data-original"]
<script>alert(1);</script>
=> nil

In other words, despite the different look of the two definitions, they both represent their payload tags in their raw form, unescaped form, because one layer of entities is unescaped when an HTML attribute is parsed. This can produce misleading results.

In the spec, note that the raw HTML (original_content) was:

(a) inserted into the attribute (resulting in one level of entity unescaping), and
(b) compared for equality at the end against doc.to_html.

If given unescaped entities like <code>, the spec effectively asserts that the data-original (logical) content was inserted directly into the page's HTML. This was the original state of the spec.

On the other hand, if given enscaped entities (like <script>...), the spec effectively asserts that the data-original (logical) content is escaped into the page's HTML, because we are asserting those entity escapes are present in the result from #to_html.

We don't actually want data-original's content to be escaped into the page's HTML --- we want data-original's content to faithfully represent the original content!

I've have modified the spec to instead construct the attribute via Nokogiri's DOM, meaning we can't be tripped up by this. There's now a pair of specs that assert that, whatever the data-original content, it finds its way correctly to the output.

Edited Oct 20, 2025 by Asherah Connor

Clearly delineate text and HTML in Banzai, and use helpers consistently to prevent XSS

Merge request reports