Blobs and diffs should display non-utf8 data correctly in the browser

Summary

Spotted in https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/32862#note_216167183

Whether external MR diffs are enabled or not, the diff for a merge request is not displayed correctly when the underlying files being diffed do not contain UTF-8-compatible data.

For SHIFT-JIS, we end up with a mojibake display on GitLab.com today (diffs stored in-database). If we enabled external MR diffs, then we might see an exception instead.

https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/32862 will "fix" the external diff case so we also see mojibake, rather than a 500, but that's still not amazing.

Steps to reproduce

Create an MR that diffs a SHIFT-JIS or cyrillic-encoded file. Anything non-utf-8-compatible.

Example Project

hiroponz/non-utf8-encoding-test!1 (diffs)

https://gitlab.com/hiroponz/non-utf8-encoding-test/blob/4f7ddac76f2bdfb94b03069adb02fee40d5c135c/%E3%83%86%E3%82%B9%E3%83%88.md

What is the current bug behavior?

Mojibake. Note that the file contains seven bytes, but only 3 characters are shown - plain ASCII.

What is the expected correct behavior?

We should render the 3 hiragana characters, regardless of how the diff is stored / cached / highlighted.

Relevant logs and/or screenshots

Screenshot_from_2019-09-12_13-48-03

Screenshot_from_2019-09-12_13-48-18

Screenshot_from_2019-09-12_13-48-41

Output of checks

This bug happens on GitLab.com

Possible fixes

I expect we need to fix this y transcoding the non-utf8-encoded parts into the utf-8 representation of the characters in the source encoding. I don't think we can reasonably pass the source encoding through to the browser.

Whatever we do, when downloading the raw file, diff, patch, commit, etc, we should always retain the original encoding. I think this is the case at present.

cc @m_gill @stanhu @hiroponz

Assignee Loading
Time tracking Loading