Broken file attachment links due to filename Unicode character composition/decomposition/normalization differences

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Summary

If I use certain Unicode characters in the name of a file on my local computer, then attach it to a GitLab issue/comment, then edit the issue/comment, GitLab can no longer find the attachment.

Steps to reproduce

  1. Create a file with decomposed Unicode in its filename
    • for example, open macOS TextEdit, then Save, and type I (capital letter i), then activate the macOS Character Viewer and select "combining dot above"
    • I can then view the filename in Terminal and confirm the text is decomposed:
      $ ls -1 *.txt | uni identify | head -3
           cpoint  dec    utf8        html       name (cat)
      'I'  U+0049  73     49          I     LATIN CAPITAL LETTER I (Uppercase_Letter)
      '◌̇'  U+0307  775    cc 87       ̇    COMBINING DOT ABOVE (Nonspacing_Mark)
  2. Attach that file to a GitLab issue or comment
    • the attachment URL uses the decomposed form (2 codepoints: "LATIN CAPITAL LETTER I" followed by "COMBINING DOT ABOVE")
  3. Edit the issue/comment, copy the text, paste into macOS TextEdit, copy, then paste back into the issue/comment and save
  4. Click the attachment link
    • the link is broken ("404 The page could not be found or you don't have permission to view it")
    • the link has changed to the composed Unicode form (1 codepoint: "LATIN CAPITAL LETTER I WITH DOT ABOVE")

Example Project

smokris/test-upload-filename-unicode#1

What is the current bug behavior?

Links to file attachments break when copy-pasting them.

What is the expected correct behavior?

Links to file attachments should still work after copy-pasting them.

Relevant logs and/or screenshots

(see above)

Output of checks

This bug happens on GitLab.com

Possible fixes

  1. GitLab could normalize (Unicode NFC) attachment filenames both during upload and during routing, so files can be found regardless of how their filename characters are encoded
  2. or, when inserting file attachment links into Markdown text, GitLab could output URL-escaped hexadecimal character references (%C4%B0) which should survive copy-pasting better than full Unicode glyphs
Edited by 🤖 GitLab Bot 🤖