Support NFD combining accents
By egm... on October 04, 2010 22:18 (imported from Google Code)
NFC means pre-composed Unicode characters, where e.g. the letter "á" (a with acute accent) has its own codepoint, in this particular case U+00E1 (UTF-8: 0xC3 0xA1)
NFD is when an accented letter is composed as the unaccented letter (e.g. "a"), followed by a combining accent (the combining acute accent is U+0301, UTF-8: 0xCC 0x81) so the whole accented letter "á" becomes U+0061 U+0301 in Unicode (yes, it's a sequence of two codepoints), that is, 0x61 0xCC 0x81 in UTF-8.
NFC is much more commonly used, this is the standard encoding on the web, this is used by all Linux systems, and even on Mac you type such accented letters from the keyboard, so file contents usually contain this encoding. Also, their glyps usually look nicer than if you just try to draw two separate glpyhs on top of each other.
NFD, on the other hand, is more flexible: you can place any accent (even more of them) on any letter. Mac OS's filesystem layer uses NFD, automatically converts the filenames from NFC. Just type the command "touch á" from your keyboard, the touch command receives its argument in NFC, but a subsequent "ls" will report back the filename in NFD.
iTerm does not support NFD combining accents properly. The most prominent example is probably that filenames containing accents might not show up properly when doing an "ls".
When printing a combining accent, the cursor should not advance, and the accent should be drawn on the preceding character. This should be the case even if there's a pause between printing the base letter and the accent, e.g. this command:
echo -n a; sleep 1; echo $'\xCC\x81bcd'
should print an "a", and a second later it should place an accent on top of this and continue (without leaving a space) with the rest of the letters, forming the "ábcd" string.
Characters that don't have NFC counterpart should be supported too, e.g.
echo -n $'q\xCC\x81rst'
should print "qrst" with an accent on the "q", even though there's no such NFC character.
Multiple accents should be supported too, e.g.
echo -n $'q\xCC\x80\xCC\x81rst'
should put both an accent grave and an accent acute on top of "q".
When copy-pasting, the NFC or NFD property of the original string should preferably be preserved.
It seems to me that NSString's initWithBytes does perform an NFD->NFC conversion whenever it can, that is, whenever the character does exist in NFC and whenever it receives both the base letter and the combining accent in the same run. Prior to r202, or if the bits causing issue #200 (closed) get reverted, ASCII and 8-bit characters are passed into different NSString initWithBytes() runs, that is, they never get combined back to NFC. The way combining accents are handled currently, they are printed on top of the preceding character, but then a cell is skipped (the accented letter is visually followed by a space) which is undesired.
In r202, ASCII and 8-bit characters are usually processed by a single NSString initWithBytes(), so they get combined to NFC if there is such an NFC character. However, either if there's no such NFC character, or if the application just happens to take a short break in the middle (e.g. its standard buffer just fills up with the base letter, it flushes the buffer, and later continues with the combining accent) then they will not get composed together.
I believe NSString initWithBytes()'s behavior of converting NFD to NFC is undesirable (you lose NFDness when copy-pasting, and you have different behavior if there's a sleep in between). I think we should remember the original sequence of Unicode codepoints.
I understand that supporting an arbitrary number of accents on top of each cell might be a memory killer. There should be many safeguards at various places, and even if we introduce arbitrary limits (e.g. support no more than N combining accents per character) storing them for every character would be a total waste of memory, so this is not the way to go.
Currently, as far as I can tell (I might be wrong) each physical line is stored in an array, every cell corresponding to one index in the array (with continuation cells of double width characters containing some special value). If support for NFD gets implemented by storing all these accents in this linear array, we'd no longer have a 1:1 mapping from array indices to cell positions, which might introduce some troubles, but might actually work out.
VTE uses a "color palette", it remembers all the different "base letter + combining accents" sequences it ever sees, and assigns them to Unicode codepoints from 0x80000000. There are some safeguards to make sure it cannot grow infinitely (that is, cat'ing /dev/urandom or an intentionally maliciously crafted file doesn't cause OOM, rather the terminal starts misbehaving by intentionally dropping some accents), and probably there's some garbage collection too to get rid of the no longer used sequences.
I have no idea how other terminal emulators implement NFD characters. Sure it's not a trivial question.