Broken file attachment links due to filename Unicode character composition/decomposition/normalization differences
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Summary
If I use certain Unicode characters in the name of a file on my local computer, then attach it to a GitLab issue/comment, then edit the issue/comment, GitLab can no longer find the attachment.
Steps to reproduce
- Create a file with decomposed Unicode in its filename
- for example, open macOS TextEdit, then Save, and type
I(capital letter i), then activate the macOS Character Viewer and select "combining dot above" - I can then view the filename in Terminal and confirm the text is decomposed:
$ ls -1 *.txt | uni identify | head -3 cpoint dec utf8 html name (cat) 'I' U+0049 73 49 I LATIN CAPITAL LETTER I (Uppercase_Letter) '◌̇' U+0307 775 cc 87 ̇ COMBINING DOT ABOVE (Nonspacing_Mark)
- for example, open macOS TextEdit, then Save, and type
- Attach that file to a GitLab issue or comment
- the attachment URL uses the decomposed form (2 codepoints: "LATIN CAPITAL LETTER I" followed by "COMBINING DOT ABOVE")
- Edit the issue/comment, copy the text, paste into macOS TextEdit, copy, then paste back into the issue/comment and save
- Click the attachment link
-
❌ the link is broken ("404 The page could not be found or you don't have permission to view it") - the link has changed to the composed Unicode form (1 codepoint: "LATIN CAPITAL LETTER I WITH DOT ABOVE")
-
Example Project
smokris/test-upload-filename-unicode#1
What is the current bug behavior?
Links to file attachments break when copy-pasting them.
What is the expected correct behavior?
Links to file attachments should still work after copy-pasting them.
Relevant logs and/or screenshots
(see above)
Output of checks
This bug happens on GitLab.com
Possible fixes
- GitLab could normalize (Unicode NFC) attachment filenames both during upload and during routing, so files can be found regardless of how their filename characters are encoded
- or, when inserting file attachment links into Markdown text, GitLab could output URL-escaped hexadecimal character references (
%C4%B0) which should survive copy-pasting better than full Unicode glyphs
Edited by 🤖 GitLab Bot 🤖