Skip to content

NFC Normalization of Å

This is an issue in the PUCU library. Copied here for reference.

https://github.com/BeRo1985/pucu/issues/8


Decomposing and normalizing the letter Å (Latin Capital Letter A with Ring Above, codepoint $00C5) produces the sequence $0041 $030A. This is correct. However, composing the sequence $0041 $030A produces the codepoint $212B (Angstrom Sign).

$00C5 and $212B are equivalent codepoints but their normal form is $00C5 so the composition is wrong.

For example, the distinct Unicode strings "U+212B" (the angstrom sign "Å") and "U+00C5" (the Swedish letter "Å") are both expanded by NFD (or NFKD) into the sequence "U+0041 U+030A" (Latin letter "A" and combining ring above "°") which is then reduced by NFC (or NFKC) to "U+00C5" (the Swedish letter "Å").

This is a pretty big problem as Å is a fairly common letter in Scandinavian languages (which is why I discovered this problem).