Encoding::CompatibilityError in Rouge lexer when highlighting diffs with ASCII-8BIT content

Summary

An Encoding::CompatibilityError occurs when viewing commits or merge requests containing files with ASCII-8BIT encoded content (such as PDFs diffed as text). The error is triggered when Rouge's lexer guesser attempts to match UTF-8 regular expressions against ASCII-8BIT strings.

Sentry Error: https://new-sentry.gitlab.net/organizations/gitlab/issues/3179853

Error Details

Encoding::CompatibilityError: incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string)
  from rouge/guessers/util.rb:13:in `sub'
  from rouge/guessers/modeline.rb:32:in `filter'
  from rouge/lexer.rb:185:in `guess'
  from lib/gitlab/highlight.rb:39:in `lexer'

Root Cause

  1. Rapid Diffs calls whitespace_only? to determine rendering, which triggers syntax highlighting
  2. Gitlab::Highlight#lexer calls Rouge::Lexer.guess(source: @blob_content)
  3. @blob_content may be ASCII-8BIT encoded from Gitaly for files with binary-like content
  4. Rouge's modeline guesser uses UTF-8 regexps, causing the encoding error

Example

Commit with PDF metadata changes: pawel-kow/documentation@0180f15c

Proposed Fix

In lib/gitlab/highlight.rb, encode content to UTF-8 before passing to Rouge:

def lexer
  @lexer ||= custom_language || begin
    source = @blob_content.to_s.dup.force_encoding(Encoding::UTF_8)
    source = source.encode(Encoding::UTF_8, invalid: :replace, undef: :replace) unless source.valid_encoding?
    
    Rouge::Lexer.guess(filename: @blob_name, source: source).new
  rescue Rouge::Guesser::Ambiguous => e
    e.alternatives.min_by(&:tag)
  end
end
Assignee Loading
Time tracking Loading