Copy and paste of multibyte character and combining accent converts character
First off, thank you so much for iTerm2! It's an amazing application.
I have stumbled over some surprising behavior when copying and pasting a multibyte character with a combining accent. The issue seems to me to be quite an edge case, but Terminal.app does not exhibit this behavior, and it caused me some confusion, so I figured I should file an issue.
The issue is that iTerm2 converts a multibyte character followed by a combining accent into an equivalent glyph, but without the combining accent, when the original character is copy and pasted. I ran into this when writing some code for converting text from an old ASCII-safe encoding of non-ASCII safe glyphs into a modern Unicode encoding.
The character in question was "ώ", or "GREEK SMALL LETTER OMEGA WITH TONOS". I had represented it as:
But when it is copied into the clipboard, it becomes:
Which looks the same, but fails a string comparison between the two. I had copied the original from iTerm2 and pasted it into my test, causing the expected data to differ from the actual data the code under test produced. I think in part this indicates that I need to change my code to produce the single character. I'm working on this project to learn more about Unicode and multibyte encodings to begin with, so the approach I took to produce the omega with accent could very well be naive or incorrect.
Terminal.app preserves the data when copy and pasting, but the behavior is so arcane to begin with that I would not be surprised if it is accepted to be an "implementation specific" aspect of writing a terminal application. Regardless, by filing an issue perhaps someone else can be saved some confusion.
I found these issues which seem possibly related:
But neither of them seem to address it directly. I have tried enabling both "Treat ambiguous-width characters as double width" and "Use HFS+ Unicode normalization" settings, but neither had an effect.
Detailed steps to reproduce the problem:
- Create a file containing the multibyte plus combing mark, hex contents
cf 89 cc 81.
catthe file from step 1 in iTerm2.
- Select the character with the mouse, copy it and then create another file from it:
echo "<cmd+v>" > file2.
- Compare the two files with hexdump:
$ hexdump file1 0000000 cf 89 cc 81 0a 0000005 $ hexdump file2 0000000 cf 8e 0a 0000003