Fix and possible simplification of SanitiseXMLString (DEBUG_NODE_XML thing).
Verbose.SanitiseXMLString
falls back to ASCII on any string with UTF-8 sequences longer than two bytes (no, really), while is supposed to do it only if the string contains invalid UTF-8 sequences. Its inner loop while X > 0 do ...handle Result[X]...
starts from the byte before last (that’s why two-byte sequences work) and never changes X
nor Result
, finally coming to the conclusion it was a too long sequence.
I have two versions of the fix.
sxs-v1.patch
This one just adds missing dec(X)
.
sxs-v2.patch
This one throws away most of the manual work with UTF-8 and replaces it with Utf8CodepointLen
, saving 80 lines of code. There is a downside, however: Utf8CodepointLen
accepts overlong sequences (for example, null character encoded as 11000000 10000000
), thus is not entirely reliable in validating UTF-8. I think I’ll fix it separately.
In either case, the result is that some Asians with 3-byte scripts or just non-English lovers of 3-byte characters in their string constants, like me with my em dashes (“—” is E2 80 94
), will get fewer corrupted strings during their adventures inside the compiler codebase.