Encoding::CompatibilityError in Rouge lexer when highlighting diffs with ASCII-8BIT content
Summary
An Encoding::CompatibilityError occurs when viewing commits or merge requests containing files with ASCII-8BIT encoded content (such as PDFs diffed as text). The error is triggered when Rouge's lexer guesser attempts to match UTF-8 regular expressions against ASCII-8BIT strings.
Sentry Error: https://new-sentry.gitlab.net/organizations/gitlab/issues/3179853
Error Details
Encoding::CompatibilityError: incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string)
from rouge/guessers/util.rb:13:in `sub'
from rouge/guessers/modeline.rb:32:in `filter'
from rouge/lexer.rb:185:in `guess'
from lib/gitlab/highlight.rb:39:in `lexer'
Root Cause
- Rapid Diffs calls
whitespace_only?to determine rendering, which triggers syntax highlighting -
Gitlab::Highlight#lexercallsRouge::Lexer.guess(source: @blob_content) -
@blob_contentmay be ASCII-8BIT encoded from Gitaly for files with binary-like content - Rouge's modeline guesser uses UTF-8 regexps, causing the encoding error
Example
Commit with PDF metadata changes: pawel-kow/documentation@0180f15c
Proposed Fix
In lib/gitlab/highlight.rb, encode content to UTF-8 before passing to Rouge:
def lexer
@lexer ||= custom_language || begin
source = @blob_content.to_s.dup.force_encoding(Encoding::UTF_8)
source = source.encode(Encoding::UTF_8, invalid: :replace, undef: :replace) unless source.valid_encoding?
Rouge::Lexer.guess(filename: @blob_name, source: source).new
rescue Rouge::Guesser::Ambiguous => e
e.alternatives.min_by(&:tag)
end
end