Fix and possible simplification of SanitiseXMLString (DEBUG_NODE_XML thing). (#39800) · Issues · FPC / FPC / FPC Source · GitLab

Fix and possible simplification of SanitiseXMLString (DEBUG_NODE_XML thing).

`Verbose.SanitiseXMLString` falls back to ASCII on any string with UTF-8 sequences **longer than two bytes** (no, really), while is supposed to do it only if the string contains **invalid** UTF-8 sequences. Its [inner loop](https://gitlab.com/freepascal.org/fpc/source/-/blob/b2a5334a7594238d83b84144e41cb0e37d8fc1c9/compiler/verbose.pas#L1256) `while X > 0 do ...handle Result[X]...` starts from the byte before last (that’s why two-byte sequences work) and never changes `X` nor `Result`, finally coming to the conclusion it was a too long sequence. I have two versions of the fix. [sxs-v1.patch](/uploads/03f67918a09d1d1cb569072e8366c5c7/sxs-v1.patch)<br> This one just adds missing `dec(X)`. [sxs-v2.patch](/uploads/d71132af51b8371339f3e64b0c1632a9/sxs-v2.patch)<br> This one throws away most of the manual work with UTF-8 and replaces it with `Utf8CodepointLen`, **saving 80 lines of code**. There is a downside, however: **`Utf8CodepointLen` accepts overlong sequences** (for example, null character encoded as `11000000 10000000`), thus is not entirely reliable in validating UTF-8. I think I’ll fix it separately. In either case, the result is that some Asians with 3-byte scripts or just non-English lovers of 3-byte characters in their string constants, like me with my em dashes (“—” is `E2 80 94`), will get fewer corrupted strings during their adventures inside the compiler codebase. Before:<br> ![before](/uploads/1882ae054c2a8d92e153a02727ceb912/before.png) After:<br> ![after](/uploads/dd9af1f3e855c8a0e873c41cacbb2ed0/after.png)

issue